Skip to content

Latest commit

 

History

History
51 lines (33 loc) · 3.06 KB

SampleData_Reference.md

File metadata and controls

51 lines (33 loc) · 3.06 KB

Sample Datasets for Chemical Space Visualization

The sample datasets are distributed within the library. In the repository they can be found here. Hereby the datasets sources are cited.

Name Formatting: type_size_name_num_of_classes.csv

  • type: R->Numerical and C->Categorical
  • size: Number of instances in the dataset
  • name: Name of dataset
  • num_of_classes: Number of classes (Categorical only)

Datasets and Sources

  1. Clintox dataset [1-4] (Toxicity) -> C_1484_CLINTOX_2.csv
  2. BACE dataset [5] (Inhibitor) -> C_1513_BACE_2.csv
  3. BBBP dataset [6] (Blood-brain barrier penetration) -> C_2039_BBBP_2.csv
  4. HIV dataset [7] -> C_41127_HIV_2.csv
  5. HIV dataset [7] -> C_41127_HIV_3.csv
  6. SAMPL dataset [8] (Hydration free energy) -> R_642_SAMPL.csv
  7. BACE dataset [5] (Binding affinity) -> R_1513_BACE.csv
  8. LOGP dataset [9] (Lipophilicity) -> R_4200_LOGP.csv
  9. LOGS dataset [10] (Aqueous Solubility) -> R_1291_LOGS.csv
  10. AQSOLDB dataset [11] (Aqueous Solubility) -> R_9982_AQSOLDB.csv

Note: Datasets 1-8 are edited versions of the MoleculeNet repository [12].

References:

[1] Gayvert, Kaitlyn M., Neel S. Madhukar, and Olivier Elemento. "A data-driven approach to predicting successes and failures of clinical trials." Cell chemical biology 23.10 (2016): 1294-1301.

[2] Artemov, Artem V., et al. "Integrated deep learned transcriptomic and structure-based predictor of clinical trials outcomes." bioRxiv (2016): 095653.

[3] Novick, Paul A., et al. "SWEETLEAD: an in silico database of approved drugs, regulated chemicals, and herbal isolates for computer-aided drug discovery." PloS one 8.11 (2013): e79568.

[4] Aggregate Analysis of ClincalTrials.gov (AACT) Database. https://www.ctti-clinicaltrials.org/aact-database

[5] Subramanian, Govindan, et al. "Computational modeling of β-secretase 1 (BACE-1) inhibitors using ligand based approaches." Journal of chemical information and modeling 56.10 (2016): 1936-1949.

[6] Martins, Ines Filipa, et al. "A Bayesian approach to in silico blood-brain barrier penetration modeling." Journal of chemical information and modeling 52.6 (2012): 1686-1697.

[7] AIDS Antiviral Screen Data. https://wiki.nci.nih.gov/display/NCIDTPdata/AIDS+Antiviral+Screen+Data

[8] Mobley, David L., and J. Peter Guthrie. "FreeSolv: a database of experimental and calculated hydration free energies, with input files." Journal of computer-aided molecular design 28.7 (2014): 711-720.

[9] Hersey, A. ChEMBL Deposited Data Set - AZ dataset; 2015. https://doi.org/10.6019/chembl3301361

[10] Huuskonen, J. (2000). Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. Journal of Chemical Information and Computer Sciences, 40(3), 773-777.

[11] Sorkun, M. C., Khetan, A., & Er, S. (2019). AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Scientific data, 6(1), 1-8.

[12] Wu, Zhenqin, et al. "MoleculeNet: a benchmark for molecular machine learning." Chemical science 9.2 (2018): 513-530.