Increasing dataset coverage

Background

Sage 2.0 was trained on Gen2 data (List of QM training/ benchmark datasets ), while Sage 2.1 was trained on a combination of Gen1 and Gen2 data ( ). These datasets have known coverage gaps of particular chemistries, including molecules containing halogens, hypervalent sulfur groups, sulfonic and phosphonic acids. These gaps may account for such problems as:

https://openforcefield.atlassian.net/wiki/spaces/FF/pages/2592604191/Further+sulf+on+amide+improvements

Many of these rarer functional groups are still important in medicinal chemistry:

Goal

To check and possibly increase the coverage of various chemistries in our training and test sets. Example molecules could be constructed (e.g. ) or filtered from various datasets, such as ChEMBL30 or https://enamine.net/building-blocks/medchem/view-all/sulfoximines .

Dataset ideas:

NCI 250K: https://cactus.nci.nih.gov/download/nci/
Chembl30: https://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBLdb/releases/chembl_30/chembl_30_release_notes.txt
PDB: http://ligand-expo.rcsb.org/dictionaries/Components-smiles-stereo-oe.smi
Zinc:
- Full: https://zinc.docking.org/
- Pre-filtered by Riniker lab (also ChEMBL): https://www.research-collection.ethz.ch/handle/20.500.11850/230799
MMFF (Merck Molecular Mechanics FF) (maybe not as training, but validation) Possible reference?
MOPAC training set

Proper Torsions

Assessing coverage

Starting from version 2 of the the Torsion multiplicity force field, which includes split torsion parameters for every incorrect torsion identified by Pavan and Meghan, I computed the coverage for every proper torsion parameter: . As shown therein the 20 parameters t18b, t59g, t60g, t61g, t62g, t87a, t115h, t116i, t116j, t123, t130g, t130h, t132g, t133g, t133h, t138a, t142j, t142k, t142l, and t143i are not covered by a single torsion in the existing Sage 2.1.0 torsion drive data set augmented with the torsion multiplicity data set. Furthermore, filtering by number of molecules rather than number of torsions reveals a much larger set of 94 parameters with fewer than 15 corresponding molecules in this training set: .

Searching the ChEMBL 33 database for molecules matching these parameters reveals that 11 of them (t31a, t59g, t60g, t122h, t123, t130g, t132g, t132i, t142l, t143i, t143l) are not covered once in the 2.4 million molecules found in the database, suggesting that these may correspond to rare chemistries. However, 6/11 are actually covered by our existing training or testing data despite never appearing in ChEMBL, leaving only 5 parameters not covered by either one: t123, t130g, t132g, t142l, and t143i. As denoted by the suffixes g-l, most of these are new parameters from the more extensive version of the torsion multiplicity force field, with the exception of t123, which has been previously flagged for deletion. In light of this, I think t123, t130g, t132g, t142l, and t143i are good candidates for deletion. On the other hand, t31a, t59g, t60g, t122h, t132i, and t143l probably need additional training coverage from another data set. Of these, t31a, t122h, t132i, and t143l have a little training coverage currently, but t59g and t60g are “covered” only by the benchmarking set. These particularly need some type of training data.

To do

Chemistry
Sulfonic and phosphonic acids
Sulfur functional groups – sulfones, sulfonates, sulfinyl, sulfoxy, sulfoximines, sulfonamides, thioethers, thioazoles, sulfonimidamines, …
Nitrogen functional groups common in drugs