Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Task

Status

Notes

Generate small molecule dataset for torsion drives

Status
colourGreen
titlecompleted

Github link macro
linkhttps://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2021-04-09-OpenFF-Gen3-Torsion-Set-v1.0
Note: this dataset does not have coverage for all Sage 2.1 torsions

Fit a force field to this new data

Status
colourGreen
titlenot startedCOMPLETED

Benchmark the force field

Status
titlecolournot started

Fitting torsion initial values to small molecules

The first approach for tackling this problem was to fit the initial torsion values to this small molecule dataset, and then optimize using the entire Sage training dataset. The goal of this approach is to start with initial values that do not incorporate the effects of e.g. large sterics, but allow the values to relax in the presence of larger molecules.

Methodology

The above small molecule dataset contains about 10 molecules per torsion parameter, but only has coverage for about half of the torsions in Sage 2.1.0. Since there are a few molecules per torsion, there are also multiple approaches to fitting the torsion initial values.

Smallest molecule

The first approach would be to choose the smallest molecule in the dataset, using either the number of heavy atoms or the total number of atoms. We then fit this molecule’s driven torsion parameters to the QM data using the following scheme:

...

Where k_tors refers to the force constants for the driven torsion, DT means “driven torsion,” N refers to the number of expansion terms in the analytical torsion profile, y refers to the periodicity, and x refers to the phase. The k value(s) that result from this fitting procedure are then used as initial values for this torsion parameter, and a full FF fit is done with the usual training data.

Average

The second approach would be to conduct the above fitting procedure for all molecules in the dataset, then average the resulting k value(s) across molecules for a given torsion. Then, the k value(s) that result from this fitting procedure are then used as initial values for this torsion parameter, and a full FF fit is done with the usual training data.

Fit intermediate FF based on all small molecules

The third approach would be to fit an intermediate force field to this small molecule dataset, considering all relevant molecules in the objective function for each torsion parameter. The benefit of this approach is that the fitting would be occurring simultaneously for all torsion parameters. The resulting FF could either be used on its own (e.g. replacing the previous fitting procedure), or those values could be used as initial values for fitting to the larger dataset.

Green
titleCOMPLETED

Seemed to make structures worse

Exploratory dataset analysis

...

  1. Within one driven torsion parameter, molecules that share this torsion have vastly different torsion barrier heights.

  2. Within one driven torsion parameter, molecules that share this torsion have different analytical torsion profiles.

  3. Within one driven torsion parameter, the analytical torsion profile is incorrect for all molecules that share this torsion.

However, it is difficult to glean too much information from just the torsion profiles, as there will be significant contributions from other energy terms as one scans a torsion angle.

1. Molecules that share the same torsion have different barrier heights

In this scenario, molecules that share the same driven torsion have torsion drive profiles with (mostly) the same overall analytical form but different torsional barrier heights. This is not unexpected, but is a fundamental issue that arises due to describing different molecules with the same force field parameters. Often these differences come down to differences in other terms of the energy, e.g. one molecule has a slightly different barrier coming from a strained angle as the geometry is distorted. Two examples in this category are t6, which describes the SMIRKS pattern [#9:1]-[#6x4:2]-[#6x4:3]-[#9:4] (F-CX2-CX2-F) and t11, which describes the SMIRKS pattern [#1:1]-[#6x4:2]-[#6x4:3]-[#17:4] (H-CX2-CX2-Cl).

...

Image RemovedImage Removed

For both of these parameters, most of the molecules follow the analytical torsion’s form (e.g. 3 peaks), but the torsional barriers vary significantly, even when examining only the “ideal” energy of the driven torsion (bottom panel). This could be attributed either to the difficulty of capturing all molecules using one parameter, or to issues with other parameters that then propagate to the torsion profile.

Because these molecules' torsion profiles generally agree with the analytical form, we can optimize the driven torsion parameter for each molecule to greatly improve agreement with the QM energy profile, shown below for molecules covered by t6.

...

As a result, the fit for most of the molecules was quite good, shown below (left). For most molecules, the objective function is below a value of 10, which typically corresponds to a very good fit as shown above for Mol 0 and Mol 6. For reference, Mol 3 has the worst fit (shown above), and its objective function has a value of 25. While most of these molecules are able to get a good fit, the difference in barrier heights leads to a nearly uniform distribution of k values between -0.1 and 0.5 kcal/mol for k1, and -0.6 to -0.2 kcal/mol for k2, leading to the question of whether the k value optimized to one molecule would be transferrable to the others. We can see below (right) that when choosing the optimized k parameters for the smallest molecule (Mol 3) and applying those to the other molecules, the fit is still quite good, typically still below a value of 10 and typically still better than the original Sage fit.

...

Image Removed

Overall, this suggests that for “well-behaved” torsion parameters, e.g. those where molecules have natural differences in barrier height but generally are well described by the analytical form, these fit parameters should be relatively transferrable.Overall, these terms should be quite transferrable between molecules as they share the same approximate shape.

2. Molecules sharing the same torsion have different qualitative torsion profiles

...

This parameter seems to have two classes of molecules, shown above. Molecules labeled Mol0, Mol1, and Mol4 all have one large peak around 0, while the rest of the molecules have a two-peak structure in agreement with the analytical torsion form shown by the blue dashed line. However, the agreement in shape between the analytical torsion and the molecules with a two-peak torsion profile is not obviously good, as the curvature is relatively different.

...

Indeed, from the MM profiles shown above using the new k value fit to each molecule’s torsion drive profile, the agreement is not really improved no matter which of the two categories of profiles the molecule falls in to.

Optimizing these k values leads to torsion profiles that have relatively high values of the objective function (e.g. poor fit) even after significant minimization, and a bimodal distribution of k values due to the two different profiles.

...

The k values from this parameter are also not very transferrable as a result. Using the optimized k values for the smallest molecule (Mol 1) yields high objective values for the other molecules, especially for Mol5 and Mol6, which have a different profile.

...

Of course, as with the previous section, it is possible that the disagreement in analytical torsion profile is due to errors in other terms that propagate to this.A closer look at this parameter reveals that the molecules with a barrier around 0 degrees (Mol 0, Mol 1, and Mol 4) all have a steric clash at that geometry (shown below), leading to a large peak in the energy as the torsion profile is scanned. Including these molecules with a less-straightforward torsion profile may cause issues when attempting to fit a general parameter, and suggests that even selecting small molecules does not necessarily remove the effect of sterics and other types of energy from training the torsions.

...

3. Analytical form of torsion is qualitatively incorrect for all molecules

This situation is pretty similar to situation 2. Just highlighting a couple of other parameters where the analytical form of the torsion should probably maybe be revisited. There were more than just these couple, and a future direction should be to find a good way to identify torsions whose form should be updated. Note that for t16, the specific driven torsion has larger energy than the QM!

Image RemovedImage RemovedImage RemovedImage Removed

For t13, while the form of the torsion is clearly qualitatively wrong (the torsion profile shows three peaks for all molecules), some of the molecules' profiles have curvature such that it isn’t so poorly represented by the single peak, leading to lower values of the objective function. For t16, none of the molecules are well represented by the single-peak structure, leading to uniformly high values and poor fits.

Moving forward, perhaps using high objective function values would be a good data-driven way to identify molecules with incorrect torsion profiles

Looking at the energy breakdown for one molecule represented by t16, we can see that the overall MM energy is represented relatively poorly, but that the overall torsion contribution is less egregious than the individual t16 term. This appears to be due to a balance of several very large torsion parameters present for the three-membered rings: t13, t15, and t16, which all have peak magnitudes between 3-8 kcal/mol. This suggests there may be a systematic issue with three-membered ring torsions.

...

Screen Shot 2024-01-17 at 11.26.29 AM.pngImage Added

t13_torsion_only.pngImage Addedt15_torsion_only.pngImage Added

One potential source of this issue could be the fact that these torsions are applied to both bridging and internal ring torsions, which would behave extremely differently as the internal ring torsion would have a high energetic penalty. The torsion parameters are only explicitly trained on torsion drives for the bridging motif, although they are applied to the internal ring torsions in the geometry optimization targets, which could affect the value. However, given that it is unlikely for the geometry optimization targets to sample a particularly strained internal ring torsion conformation, I’m not sure this is a likely source of the error.

Fitting Methodology

This project was undertaken in two parts. First, we fit the initial torsion values to this small molecule dataset, and then optimize all the parameters using the entire Sage training dataset. The goal of this approach is to start with initial values that do not incorporate the effects of e.g. large sterics, but allow the values to relax in the presence of larger molecules.

The above small molecule dataset contains about 10 molecules per torsion parameter, but only has coverage for about half of the torsions in Sage 2.1.0. If this subset of torsions shows significant improvement, we can expand the dataset to train all the torsions on small molecules.

Fit initial FF based on small molecules

First, we fit an initial force field to the small molecule dataset. We tried two approaches, differing in the starting point for the force field fit. The first force field is started from the same starting point as Sage, e.g. using MSM values for bonds/angles (calculated based on the whole training dataset), and Sage 2.1.0 initial value torsions; this is referred to as Small Molecule Torsion (SMT) + MSM. The second is started from Sage 2.1.0 itself, e.g. bonds, angles, and torsions all taken from Sage 2.1.0.; this is referred to as Small Molecule Torsion (SMT) + Sage. For both force fields, the bonds and angles were held constant at these initial values, while the torsions were trained on the small molecule torsion drive dataset. These fits were performed with the Force Balance 1.9.3 software stack.

Re-fit Sage based on small molecule torsion starting point

Next, we re-fit Sage. For both approaches, we start from the MSM bonds/angles calculated for the whole training dataset, and the torsions taken from the initial force fields above (SMT + MSM and SMT + Sage). All parameters are then trained on the Sage training dataset. The fits were performed with the Force Balance 1.9.3 software stack.

Results

The results of benchmarking the two force fields, using the industry benchmark dataset, are shown below. While the ddE benchmark is comparable to Sage 2.1.0, the structural metrics RMSD and TFD are both substantially worse than Sage 2.1.0. All molecules in the dataset are covered by at least one of the re-trained torsions, so you can’t filter for only molecules affected by the re-train.

...

There are a couple of possible areas of improvement. One could be to allow the bonds and angles to optimize during the initial FF fit, although this was not pursued initially as it could cause issues if the bonds/angles change dramatically and then the torsions are fit using very different bonds/angles than Sage. Another potential area for improvement would be filtering out small molecules with large steric clashes, like those present in t82, as these could bias the torsions. A third area would be expanding the dataset to include more molecules per torsion, and cover all the torsions. However, these are not being pursued at this time due to the discouraging benchmark results above.