Batch upload browser

From QSPR Wiki

Jump to:navigation, search

This browser allows to upload data in SDF or Excel format.

The batch upload allows to upload large pieces of data stored in Excel or SDF file. After the upload of data, user can explore them and decide whether he/she wants to store or discard them. If data have been stored after the upload, they all be placed in default unnamed Basket and can be further BATCH EDITed.


Excel file

One property record corresponds to one row in the Excel file. Each column correspond to different features of the property record. A sample file with description of columns can be downloaded. The BATCH UPLOAD TUTORIAL provides step-by-step instructions how the users can create and upload the data.

Private/Hidden data

In the excel sheet, create a column with title 'HIDDEN'. Each row that contains "1" in this column will make the corresponding record private, that means hidden to other users. It is possible to publish the record later after it has been verified without uploading it again. Use the batch edit feature to publish multiple records at once.

Expert Estimations and Predicted Values"

Create a column with title 'ESTIMATED' in the excel sheet. Each row that contains "1" in this column will make the corresponding record an "Expert estimation / Prediction". If you wish to publish a static model with an applicability domain of solely N compounds for which you have expert/model predictions, upload the experimental data "normally" and, additionally the predictions using this flag. Then create a static model. Also, model interpretation methods are applicable to sets of expert estimations. If you wish to search for corresponding experimental information for your molecules, create a molecule tag using a basket of experimental or expert values and select this tag in the upper right corner of the browser. Refresh the measured properties browser.

Checklist before uploading excel

Don't forget to add a 'Reference Paper' column containing a paper ID or QID before starting the batch upload.

SDF file

Use the same convention for naming as in the Excel file.

Helping thread on how to upload data from arbitrary source if you have...

This guide assumes that you have excel installed on your machine. Sometimes, you will need to install additional software that is free and public available.

Badly formatted text input that cannot be directly opened to a valid excel sheet

To collect multiple spaces to one, so 'evil' 13-13-10 linebreaks etc. disappear you can use the tr on unix/mac os. Assuming the input text file is data.txt, correct to data.corr.txt by tr -s '[:space:]' < data.txt > BP\ SOMspl\ data.corr.txt Then open data.corr.txt in excel.

No SMILES but data files called 'compound_name.<something>' that contain structural information (in arbitrary subdirectory structure) and a <compound_name>s column

Use babel (-> download openbabel, free software) to convert all the neat files you get in all the subdirectorys into one smiles-to-filename-without-extension-table by the unix command

find . -name *.hin -execdir babel -b -h -i hin {} -o can \; | sed s/".hin"// > all_canonical_smiles.txt

or, with nitro group correction

find . -name *.hin -execdir babel -b -h -i hin {} -o can \; | sed -e "s/.hin//" -e "s/N(O)O/N(=O)=O/g" -e "s/ON(O)/O=N(=O)/g" > all_canonical_smiles.txt

Now, import the data (new.corr.txt) into excel (canonical). Import the all_canonical_smiles into excel, which is canonical as well. Now, use VLOOKUP to reference to a correct SMILES for each given 'Compound-it-that-was-selected-as-it-comes-and-contains-duplicates'. e.g. C2 should contain =VLOOKUP(A2 & "",$N:$O, 2, FALSE) if A# contains 'CompoundID', N# contains the "CompoundID-Filenames", O# contains the SMILES-from-filenames Then, copy the contents of the C# column to the clipboard, paste them into a _text_ editor, copy them back and paste them into the 'D#' column. (see also section Compound names and name-smiles pairs)

Compound names and name-smiles pairs

If you have molecular identifiers or given names which are ambiguous or cannot be resolved automatically in the excel sheet you want to upload, either add corresponding SMILES by hand or, if you have an excel sheet with name - smiles pairs, you can use vlookup.

E.g.: use "=VLOOKUP(B3 & "";$V:$W; 2;FALSE)" to automatically select SMILES. In this case, B3 is the field to take the SMILES, &"" is to make sure it is parsed as a string, $V:$W are the columns to search in, 2 means that SMILES are in the column 1 right of the NAME, FALSE means only return real hits. You might want to use TRIM(B3) instead of B3 to remove leading, trailing or duplicate spaces.

One or multiple columns containing 'data set numbers'

Use the float-basket conditions for batch importing if you want to have numbers here. Or, use the data set condition and let the column contain valid data set entities (Data 1, Data 2 etc.)