Torsion multiplicity

Background

Torsion parameters in OpenFF describe the energy of rotating around the central bond of the torsion. A central bond can have many different torsions applied. The number of torsions applied is the multiplicity. In general, a torsional parameter should only apply to bonds of a particular multiplicity. However, previous analysis of our parameters has uncovered many torsions that are not specific enough: https://openforcefield.atlassian.net/wiki/spaces/FF/pages/2178416641

The tar file is a copy of the Google Drive folder linked in “Analysis of Torsion multiplicity” minus the very large “old_images” directory.

Goal

We should split out our torsions to only apply to single multiplicities. Even if this does not substantially improve benchmarks, it is a philosophical error. If it makes benchmarks substantially worse, we should re-assess.

Generated Datasets

Sage 2.0.0

  • Force field file with the new torsion terms (added to Sage 2.0.0), obtained from JM, that needs to be fit

    • initial attempt to refit with the new torsions showed degraded performance on benchmarks

Sage 2.1.0

The torsion terms initially added to Sage 2.0.0 are first ported to Sage 2.1.0 by

  1. Removing the torsions present in Sage 2.0.0 but not present in the torsion multiplicity (TM) force field above

  2. Adding the torsions present in the TM force field and not present in Sage 2.0.0, appending an “x” to any IDs already used in Sage 2.1.0

  3. Sorting the torsions by ID, first by their numerical part and then by their string suffix

These steps are handled by this script.

Training coverage

The training data for the TM force field is a combination of two sources:

  1. The original Sage 2.1.0 optimization and torsion drive data sets

  2. The torsion multiplicity data sets generated and linked above

The Sage 2.1.0 sets are taken from the Sage 2.1.0 repo and the TM sets are taken from the links above. Additionally, the Sage 2.1.0 sets are re-filtered for compatibility with the updated environment using this script to filter out any molecules with AM1BCC-ELF10 charging issues. The TM data sets are taken through a more general filtering scheme from this script. Then, the resulting subsets are combined together to yield the full TM training data.

Here I am defining “coverage” to mean at least one part of one molecule matches a parameter’s SMIRKS pattern as defined by the ForceField.label_molecules method. Using this metric, the initial coverage for the Sage 2.1.0 proper torsion parameters and the Sage 2.1.0 torsion drive data set is 97.2%, with 176/181 parameters covered. The 5 without training data (uncovered) are t18a, t18b, t87a, t123, and t138a. The full report can be found in this file:

The coverage for the TM force field with the TM data set is given by the file below. In short, the relative coverage decreases to 93.9% with 200/213 parameters covered. The missing parameters are now t18b, t87a, t116b, t116c, t123, t130a, t130b, t132a, t138a, t142d, t142e, t142f, and t143c. 4/5 of the previously uncovered parameters are still uncovered, plus 9 of the new parameters. Of course, the categorical covered/uncovered metric is not ideal because some of the covered parameters still have very little training data. For example, t18a, which was previously uncovered, is now considered covered, but it only matches 6 torsions in the whole TorsionDrive data set.

To separate the effects of the altered training data from the altered force field, I have also refit the Sage force field with the TM data set. The coverage for this set is given in the file below. This combination increases the coverage to 97.8% by covering t18a, as described above.

Results

image-20240108-190743.png
All records

Overall, the performance of the TM force field seems comparable to the original Sage 2.1.0 force field. The DDE graph, in particular, shows the TM results performing a bit worse than the original, but this discrepancy appears to be within the margin of error found for repeated Sage 2.1.0 training runs (https://openforcefield.atlassian.net/wiki/spaces/FF/pages/2676457500).

image-20240108-204326.png
Records affected by the new parameters

Restricting the results only to those records matching one of the new parameters (defined as parameters present in the TM force field and not present in Sage 2.1.0; given in the file below) does not change much at all about the qualitative results. This makes sense because 62870 of the records match one of the new parameters out of the 71760 records in the results.

Similarly, restricting the results to the records strictly not affected by the new parameters, shows the same trend. If anything, these may look slightly farther away from the original Sage 2.1.0 values.

However, as the tables above demonstrate, many of the new parameters have higher average errors than their parents. For example, t122 is applied 9572 times with Sage 2.1.0 and has an average error of 0.16 kcal/mol. Its child parameters in the TM force field, t122b, c, and f, are applied a total of 9572 times and have average errors of 0.21, 0.17, and 0.10 kcal/mol, respectively. Only the last of these is lower than the original t122 value, and it represents the lowest count anyway, so a weighted average gives a higher overall error of 0.20 kcal/mol. The trends are less clear for the Sage vs Sage-TM comparison. Table 3 shows that the addition of the TM training data leads to a decrease in the average error for t122, t130, and t143, but an increase for t164 and t142.

Another factor demonstrated by these tables is that many of the new child parameters are not covered by the benchmarking data set in this case. t143, for example, has a, b, c, d, e, and f variants in the TM force field, but only a, b, and e appear in this data. In contrast, only t143c was not covered by the training set.

Conclusion

Overall, I think the results look comparable to the Sage 2.1.0 benchmarks, at least on the DDE, RMSD, and TFD metrics. Combining this with the philosophical argument of separating the handling of torsions with different multiplicity values, I think these new torsion parameters are ready for inclusion in the main Sage force field. Further, augmenting the training data with additional coverage for the new (and existing) parameters should only improve the quality of the resulting force field.

Status

 

Status

 

Status

Generate data for torsion re-fits

Completed

Split out torsion parameters into specific multiplicities

Completed

Port the split torsions to 2.1.0 from the above FF file. Keep them in the same order as in the FF file above, as ordering matters in SMIRNOFF force fields.

Completed

Check the number of torsion training targets (i.e. TorsionDriveRecord) available for each of the new parameters. Any deficiencies in training data will result in a torsion parameter not having enough training data

Completed

Re-fit the force field with the added torsions with split multiplicities. Re-fit Sage 2.1.0 to the same data for a strict comparison. The data should be all data used to fit with Sage 2.1.0, with additional torsions from the new datasets listed above.

COMPLETED

Run benchmarks. Hopefully show improvement in torsion profiles for parameters that got split.

COMPLETED