Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Background

Sage 2.0 was trained on Gen2 data (List of QM training/ benchmark datasets ), while Sage 2.1 was trained on a combination of Gen1 and Gen2 data (

Github link macro
linkhttps://github.com/openforcefield/sage-2.1.0/blob/main/inputs-and-outputs/2.1.0-dataset-curation.py
). These datasets have known coverage gaps of particular chemistries, including molecules containing halogens, hypervalent sulfur groups, sulfonic and phosphonic acids. These gaps may account for such problems as:

...

Proper Torsions

Assessing coverage

Starting from version 2 of the the Torsion multiplicity force field, which includes split torsion parameters for every incorrect torsion identified by Pavan and Meghan, I computed the coverage for every proper torsion parameter:

View file
namev2-tm.dat
. As shown therein the 20 parameters t18b, t59g, t60g, t61g, t62g, t87a, t115h, t116i, t116j, t123, t130g, t130h, t132g, t133g, t133h, t138a, t142j, t142k, t142l, and t143i are not covered by a single torsion in the existing Sage 2.1.0 torsion drive data set augmented with the torsion multiplicity data set. Furthermore, filtering by number of molecules rather than number of torsions reveals a much larger set of 94 parameters with fewer than 15 corresponding molecules in this training set:
View file
namewant.params
.

Searching the ChEMBL 33 database for molecules matching these parameters reveals that 41 of them (

View file
nameno_chembl.dat
) are not covered once in the 2.4 million molecules found in the database, suggesting that these may correspond to rare chemistries. However, 24/41 are actually covered by our existing training data despite never appearing in ChEMBL, leaving only 17 parameters not covered by either one: t59g, t60g, t61g, t62g, t115h, t116i, t116j, t123, t130g, t130h, t132g, t133g, t133h, t142j, t142k, t142l, and t143i. Thus, only 3 parameters, t18b, t87a, and t138a are covered by ChEMBL but not the original data set. As denoted by the suffixes g-l, most of these are new parameters from the more extensive version of the torsion multiplicity force field, with the exception of t123, which has been previously flagged for deletion.

As a final check, I also searched for these parameters in our industry benchmarking data set. This time, only 13 parameters are not covered by ChEMBL, the training set, or the benchmarking set: t115h, t116i, t116j, t123, t130g, t130h, t132g, t133g, t133h, t142j, t142k, t142l, t143i. In other words, despite not being found in ChEMBL, t59g, t60g, t61g, and t62g are all covered by our industry benchmark.

In light of this, I think t115h, t116i, t116j, t123, t130g, t130h, t132g, t133g, t133h, t142j, t142k, t142l, t143i are good candidates for deletion, while t59g, t60g, t61g, and t62g likely need some kind of coverage in the training set.

...

To do

Chemistry

Sulfonic and phosphonic acids

Sulfur functional groups – sulfones, sulfonates, sulfinyl, sulfoxy, sulfoximines, sulfonamides, thioethers, thioazoles, sulfonimidamines, …

Nitrogen functional groups common in drugs

Attachments

View file
nameEnamine_Sulfoximines_431cmpds_20210302.sdf

...