Background
Sage 2.0 was trained on Gen2 data (List of QM training/ benchmark datasets ), while Sage 2.1 was trained on a combination of Gen1 and Gen2 data ( ). These datasets have known coverage gaps of particular chemistries, including molecules containing halogens, hypervalent sulfur groups, sulfonic and phosphonic acids. These gaps may account for such problems as:
Many of these rarer functional groups are still important in medicinal chemistry:
Goal
To check and possibly increase the coverage of various chemistries in our training and test sets. Example molecules could be constructed (e.g. ) or filtered from various datasets, such as ChEMBL30 or https://enamine.net/building-blocks/medchem/view-all/sulfoximines .
Dataset ideas:
NCI 250K: https://cactus.nci.nih.gov/download/nci/
Chembl30: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_30/chembl_30_release_notes.txt
PDB: http://ligand-expo.rcsb.org/dictionaries/Components-smiles-stereo-oe.smi
Zinc:
Pre-filtered by Riniker lab (also ChEMBL): https://www.research-collection.ethz.ch/handle/20.500.11850/230799
MMFF (Merck Molecular Mechanics FF) (maybe not as training, but validation) Possible reference?
MOPAC training set
To do
Chemistry | ||
---|---|---|
Sulfonic and phosphonic acids | ||
Sulfur functional groups – sulfones, sulfonates, sulfinyl, sulfoxy, sulfoximines, sulfonamides, thioethers, thioazoles, sulfonimidamines, … | ||
Nitrogen functional groups common in drugs |
Attachments