Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

To check and possibly increase the coverage of various chemistries in our training and test sets. Example molecules could be constructed (e.g.

Github link macro
linkhttps://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2021-04-09-OpenFF-Gen3-Torsion-Set-v1.0
) or filtered from various datasets, such as ChEMBL30 or https://enamine.net/building-blocks/medchem/view-all/sulfoximines .

Dataset ideas:

...

Searching the ChEMBL 33 database for molecules matching these parameters reveals that 41 11 of them (

View file
nameno_chembl.dat
t31a, t59g, t60g, t122h, t123, t130g, t132g, t132i, t142l, t143i, t143l) are not covered once in the 2.4 million molecules found in the database, suggesting that these may correspond to rare chemistries. However, 246/41 11 are actually covered by our existing training or testing data despite never appearing in ChEMBL, leaving only 17 5 parameters not covered by either one: t59g, t60g, t61g, t62g, t115h, t116i, t116j, t123, t130g, t130h, t132g, t133g, t133h, t142j, t142k, t142l, and t143i. Thus, only 3 parameters, t18b, t87a, and t138a are covered by ChEMBL but not the original data set. As denoted by the suffixes g-l, most of these are new parameters from the more extensive version of the torsion multiplicity force field, with the exception of t123, which has been previously flagged for deletion. As a final check, I also searched for these parameters in our industry benchmarking data set. This time, only 13 parameters are not covered by ChEMBL, the training set, or the benchmarking set: t115h, t116i, t116j, t123, t130g, t130h, t132g, t133g, t133h, t142j, t142k, t142l, t143i. In other words, despite not being found in ChEMBLIn light of this, I think t123, t130g, t132g, t142l, and t143i are good candidates for deletion. On the other hand, t31a, t59g, t60g, t61g, and t62g are all covered by our industry benchmark.

In light of this, I think t115h, t116i, t116j, t123, t130g, t130h, t132g, t133g, t133h, t142j, t142k, t142l, t143i are good candidates for deletion, while t59g, t60g, t61g, and t62g likely need some kind of coverage in the training set.

...

t122h, t132i, and t143l probably need additional training coverage from another data set. Of these, t31a, t122h, t132i, and t143l have a little training coverage currently, but t59g and t60g are “covered” only by the benchmarking set. These particularly need some type of training data.

...

To do

Chemistry

Sulfonic and phosphonic acids

Sulfur functional groups – sulfones, sulfonates, sulfinyl, sulfoxy, sulfoximines, sulfonamides, thioethers, thioazoles, sulfonimidamines, …

Nitrogen functional groups common in drugs

...