...
To check and possibly increase the coverage of various chemistries in our training and test sets. Example molecules could be constructed (e.g.
Github link macro | ||
---|---|---|
|
Dataset ideas:
NCI 250K: https://cactus.nci.nih.gov/download/nci/
Chembl30: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_30/chembl_30_release_notes.txt
PDB: http://ligand-expo.rcsb.org/dictionaries/Components-smiles-stereo-oe.smi
Zinc:
Pre-filtered by Riniker lab (also ChEMBL): https://www.research-collection.ethz.ch/handle/20.500.11850/230799
MMFF (Merck Molecular Mechanics FF) (maybe not as training, but validation) Possible reference?
MOPAC training set
...
Searching the ChEMBL 33 database for molecules matching these parameters reveals that 41 11 of them (
View file | ||
---|---|---|
|
In light of this, I think t115h, t116i, t116j, t123, t130g, t130h, t132g, t133g, t133h, t142j, t142k, t142l, t143i are good candidates for deletion, while t59g, t60g, t61g, and t62g likely need some kind of coverage in the training set.
...
t122h, t132i, and t143l probably need additional training coverage from another data set. Of these, t31a, t122h, t132i, and t143l have a little training coverage currently, but t59g and t60g are “covered” only by the benchmarking set. These particularly need some type of training data.
...
To do
Chemistry | ||
---|---|---|
Sulfonic and phosphonic acids | ||
Sulfur functional groups – sulfones, sulfonates, sulfinyl, sulfoxy, sulfoximines, sulfonamides, thioethers, thioazoles, sulfonimidamines, … | ||
Nitrogen functional groups common in drugs |
...