Torsion multiplicity

Background

Torsion parameters in OpenFF describe the energy of rotating around the central bond of the torsion. A central bond can have many different torsions applied. The number of torsions applied is the multiplicity. In general, a torsional parameter should only apply to bonds of a particular multiplicity. However, previous analysis of our parameters has uncovered many torsions that are not specific enough: Analysis of Torsion multiplicity

Goal

We should split out our torsions to only apply to single multiplicities. Even if this does not substantially improve benchmarks, it is a philosophical error. If it makes benchmarks substantially worse, we should re-assess.

Generated Datasets

Sage 2.0.0

Force field file with the new torsion terms (added to Sage 2.0.0), obtained from JM, that needs to be fit
- initial attempt to refit with the new torsions showed degraded performance on benchmarks

Sage 2.1.0

The torsion terms initially added to Sage 2.0.0 are first ported to Sage 2.1.0 by

Removing the torsions present in Sage 2.0.0 but not present in the torsion multiplicity (TM) force field above
Adding the torsions present in the TM force field and not present in Sage 2.0.0, appending an “x” to any IDs already used in Sage 2.1.0
Sorting the torsions by ID, first by their numerical part and then by their string suffix

These steps are handled by this script.

Training coverage

The training data for the TM force field is a combination of two sources:

The original Sage 2.1.0 optimization and torsion drive data sets
The torsion multiplicity data sets generated and linked above

The Sage 2.1.0 sets are taken from the Sage 2.1.0 repo and the TM sets are taken from the links above. Additionally, the Sage 2.1.0 sets are re-filtered for compatibility with the updated environment using this script to filter out any molecules with AM1BCC-ELF10 charging issues. The TM data sets are taken through a more general filtering scheme from this script. Then, the resulting subsets are combined together to yield the full TM training data.

Here I am defining “coverage” to mean at least one part of one molecule matches a parameter’s SMIRKS pattern as defined by the ForceField.label_molecules method. Using this metric, the initial coverage for the Sage 2.1.0 proper torsion parameters and the Sage 2.1.0 torsion drive data set is 97.2%, with 176/181 parameters covered. The 5 without training data (uncovered) are t18a, t18b, t87a, t123, and t138a. The full report can be found in this file:

The coverage for the TM force field with the TM data set is given by the file below. In short, the relative coverage decreases to 93.9% with 200/213 parameters covered. The missing parameters are now t18b, t87a, t116b, t116c, t123, t130a, t130b, t132a, t138a, t142d, t142e, t142f, and t143c. 4/5 of the previously uncovered parameters are still uncovered, plus 9 of the new parameters. Of course, the categorical covered/uncovered metric is not ideal because some of the covered parameters still have very little training data. For example, t18a, which was previously uncovered, is now considered covered, but it only matches 6 torsions in the whole TorsionDrive data set.

To separate the effects of the altered training data from the altered force field, I have also refit the Sage force field with the TM data set. The coverage for this set is given in the file below. This combination increases the coverage to 97.8% by covering t18a, as described above.

Results

Overall, the performance of the TM force field seems comparable to the original Sage 2.1.0 force field. The DDE graph, in particular, shows the TM results performing a bit worse than the original, but this discrepancy appears to be within the margin of error found for repeated Sage 2.1.0 training runs (Reproducibility).

Restricting the results only to those records matching one of the new parameters (defined as parameters present in the TM force field and not present in Sage 2.1.0; given in the file below) does not change much at all about the qualitative results. This makes sense because 62870 of the records match one of the new parameters out of the 71760 records in the results.

Similarly, restricting the results to the records strictly not affected by the new parameters, shows the same trend. If anything, these may look slightly farther away from the original Sage 2.1.0 values.

Conclusion

Overall, I think the results look comparable to the Sage 2.1.0 benchmarks, at least on the DDE, RMSD, and TFD metrics. Combining this with the philosophical argument of separating the handling of torsions with different multiplicity values, I think these new torsion parameters are ready for inclusion in the main Sage force field. Further, augmenting the training data with additional coverage for the new (and existing) parameters should only improve the quality of the resulting force field.

Status

	Status
Generate data for torsion re-fits	COMPLETED
Split out torsion parameters into specific multiplicities	COMPLETED
Port the split torsions to 2.1.0 from the above FF file. Keep them in the same order as in the FF file above, as ordering matters in SMIRNOFF force fields.	COMPLETED
Check the number of torsion training targets (i.e. `TorsionDriveRecord`) available for each of the new parameters. Any deficiencies in training data will result in a torsion parameter not having enough training data	COMPLETED
Re-fit the force field with the added torsions with split multiplicities. Re-fit Sage 2.1.0 to the same data for a strict comparison. The data should be all data used to fit with Sage 2.1.0, with additional torsions from the new datasets listed above.	COMPLETED
Run benchmarks. Hopefully show improvement in torsion profiles for parameters that got split.	COMPLETED