Fitting torsions to small molecules

Background

OpenFF force field torsion parameters are fit to torsion drive profiles of molecules in the training set. Many of these molecules are quite large, meaning that the contribution of the torsional rotation to the final energy may be confounded by substantial non-bonded interactions. A new dataset of smaller molecules generated systematically by combining functional groups has been generated.

Additional reading on the dataset generation is here:

Generation of simple molecule set using Constructure

Goal

Fit torsion parameters to these smaller molecules and benchmark whether performance has improved.

To do

Task	Status	Notes
Generate small molecule dataset for torsion drives	COMPLETED	Note: this dataset does not have coverage for all Sage 2.1 torsions
Fit a force field to this new data	NOT STARTED
Benchmark the force field	NOT STARTED

Fitting torsion initial values to small molecules

The first approach for tackling this problem was to fit the initial torsion values to this small molecule dataset, and then optimize using the entire Sage training dataset. The goal of this approach is to start with initial values that do not incorporate the effects of e.g. large sterics, but allow the values to relax in the presence of larger molecules.

Methodology

The above small molecule dataset contains about 10 molecules per torsion parameter, but only has coverage for about half of the torsions in Sage 2.1.0. Since there are a few molecules per torsion, there are also multiple approaches to fitting the torsion initial values.

Smallest molecule

The first approach would be to choose the smallest molecule in the dataset, using either the number of heavy atoms or the total number of atoms. We then fit this molecule’s driven torsion parameters to the QM data using the following scheme:

Where k_tors refers to the force constants for the driven torsion, DT means “driven torsion,” N refers to the number of expansion terms in the analytical torsion profile, y refers to the periodicity, and x refers to the phase. The k value(s) that result from this fitting procedure are then used as initial values for this torsion parameter, and a full FF fit is done with the usual training data.

Average

The second approach would be to conduct the above fitting procedure for all molecules in the dataset, then average the resulting k value(s) across molecules for a given torsion. Then, the k value(s) that result from this fitting procedure are then used as initial values for this torsion parameter, and a full FF fit is done with the usual training data.

Fit intermediate FF based on all small molecules

The third approach would be to fit an intermediate force field to this small molecule dataset, considering all relevant molecules in the objective function for each torsion parameter. The benefit of this approach is that the fitting would be occurring simultaneously for all torsion parameters. The resulting FF could either be used on its own (e.g. replacing the previous fitting procedure), or those values could be used as initial values for fitting to the larger dataset.

Exploratory dataset analysis

Before beginning any fitting procedures, I conducted an exploratory dataset analysis, investigating the types of molecules and torsion parameters that were included in the dataset, how well Sage currently performed for these molecules, and what their torsion profiles looked like. From this analysis I identified 3 challenges:

Within one driven torsion parameter, molecules that share this torsion have vastly different torsion barrier heights.
Within one driven torsion parameter, molecules that share this torsion have different analytical torsion profiles.
Within one driven torsion parameter, the analytical torsion profile is incorrect for all molecules that share this torsion.

1. Molecules that share the same torsion have different barrier heights

In this scenario, molecules that share the same driven torsion have torsion drive profiles with (mostly) the same overall analytical form but different torsional barrier heights. This is not unexpected, but is a fundamental issue that arises due to describing different molecules with the same force field parameters. Two examples in this category are t6, which describes the SMIRKS pattern [#9:1]-[#6x4:2]-[#6x4:3]-[#9:4] (F-CX2-CX2-F) and t11, which describes the SMIRKS pattern [#1:1]-[#6x4:2]-[#6x4:3]-[#17:4] (H-CX2-CX2-Cl).

For both of these parameters, most of the molecules follow the analytical torsion’s form (e.g. 3 peaks), but the torsional barriers vary significantly, even when examining only the “ideal” energy of the driven torsion (bottom panel). This could be attributed either to the difficulty of capturing all molecules using one parameter, or to issues with other parameters that then propagate to the torsion profile.

Because these molecules' torsion profiles generally agree with the analytical form, we can optimize the driven torsion parameter for each molecule to greatly improve agreement with the QM energy profile, shown below for molecules covered by t6.

As a result, the fit for most of the molecules was quite good, shown below (left). For most molecules, the objective function is below a value of 10, which typically corresponds to a very good fit as shown above for Mol 0 and Mol 6. For reference, Mol 3 has the worst fit (shown above), and its objective function has a value of 25. While most of these molecules are able to get a good fit, the difference in barrier heights leads to a nearly uniform distribution of k values between -0.1 and 0.5 kcal/mol for k1, and -0.6 to -0.2 kcal/mol for k2, leading to the question of whether the k value optimized to one molecule would be transferrable to the others. We can see below (right) that when choosing the optimized k parameters for the smallest molecule (Mol 3) and applying those to the other molecules, the fit is still quite good, typically still below a value of 10 and typically still better than the original Sage fit.

Overall, this suggests that for “well-behaved” torsion parameters, e.g. those where molecules have natural differences in barrier height but generally are well described by the analytical form, these fit parameters should be relatively transferrable.

2. Molecules sharing the same torsion have different qualitative torsion profiles

For a number of torsion parameters, the molecules for that parameter’s dataset do not share the same qualitative energy profile, either for the QM torsion drive or for the driven torsion profile specifically. One example is t82, which covers the SMIRKS [*:1]-[#7X2, #7X3+1:2]-[#6X3:3]-[*:4] ( * = N - CX - *), shown below.

This parameter seems to have two classes of molecules, shown above. Molecules labeled Mol0, Mol1, and Mol4 all have one large peak around 0, while the rest of the molecules have a two-peak structure in agreement with the analytical torsion form shown by the blue dashed line. However, the agreement in shape between the analytical torsion and the molecules with a two-peak torsion profile is not obviously good, as the curvature is relatively different.

Indeed, from the MM profiles shown above using the new k value fit to each molecule’s torsion drive profile, the agreement is not really improved no matter which of the two categories of profiles the molecule falls in to.

Optimizing these k values leads to torsion profiles that have relatively high values of the objective function (e.g. poor fit) even after significant minimization, and a bimodal distribution of k values due to the two different profiles.

The k values from this parameter are also not very transferrable as a result. Using the optimized k values for the smallest molecule (Mol 1) yields high objective values for the other molecules, especially for Mol5 and Mol6, which have a different profile.

Of course, as with the previous section, it is possible that the disagreement in analytical torsion profile is due to errors in other terms that propagate to this.

3. Analytical form of torsion is qualitatively incorrect for all molecules

This situation is pretty similar to situation 2. Just highlighting a couple of other parameters where the analytical form of the torsion should probably be revisited. There were more than just these couple, and a future direction should be to find a good way to identify torsions whose form should be updated. Note that for t16, the specific driven torsion has larger energy than the QM!

For t13, while the form of the torsion is clearly qualitatively wrong (the torsion profile shows three peaks for all molecules), some of the molecules' profiles have curvature such that it isn’t so poorly represented by the single peak, leading to lower values of the objective function. For t16, none of the molecules are well represented by the single-peak structure, leading to uniformly high values and poor fits.

Moving forward, perhaps using high objective function values would be a good data-driven way to identify molecules with incorrect torsion profiles.