...
Fit torsion parameters to these smaller molecules and benchmark whether performance has improved.
To do
Task | Status | Notes | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Generate small molecule dataset for torsion drives |
|
| ||||||||||
Fit a force field to this new data |
|
| ||||||||
Benchmark the force field |
| Seemed to make structures worse |
Exploratory dataset analysis
Before beginning any fitting procedures, I conducted an exploratory dataset analysis, investigating the types of molecules and torsion parameters that were included in the dataset, how well Sage currently performed for these molecules, and what their torsion profiles looked like. From this analysis I identified 3 challenges:
Within one driven torsion parameter, molecules that share this torsion have vastly different torsion barrier heights.
Within one driven torsion parameter, molecules that share this torsion have different analytical torsion profiles.
Within one driven torsion parameter, the analytical torsion profile is incorrect for all molecules that share this torsion.
However, it is difficult to glean too much information from just the torsion profiles, as there will be significant contributions from other energy terms as one scans a torsion angle.
1. Molecules that share the same torsion have different barrier heights
In this scenario, molecules that share the same driven torsion have torsion drive profiles with (mostly) the same overall analytical form but different torsional barrier heights. This is not unexpected, but is a fundamental issue that arises due to describing different molecules with the same force field parameters. Often these differences come down to differences in other terms of the energy, e.g. one molecule has a slightly different barrier coming from a strained angle as the geometry is distorted. Two examples in this category are t6, which describes the SMIRKS pattern [#9:1]-[#6x4:2]-[#6x4:3]-[#9:4] (F-CX2-CX2-F) and t11, which describes the SMIRKS pattern [#1:1]-[#6x4:2]-[#6x4:3]-[#17:4] (H-CX2-CX2-Cl).
...
Overall, these terms should be quite transferrable between molecules as they share the same approximate shape.
2. Molecules sharing the same torsion have different qualitative torsion profiles
For a number of torsion parameters, the molecules for that parameter’s dataset do not share the same qualitative energy profile, either for the QM torsion drive or for the driven torsion profile specifically. One example is t82, which covers the SMIRKS [*:1]-[#7X2, #7X3+1:2]-[#6X3:3]-[*:4] ( * = N - CX - *), shown below.
...
This parameter seems to have two classes of molecules, shown above. Molecules labeled Mol0, Mol1, and Mol4 all have one large peak around 0, while the rest of the molecules have a two-peak structure in agreement with the analytical torsion form shown by the blue dashed line. However, the agreement in shape between the analytical torsion and the molecules with a two-peak torsion profile is not obviously good, as the curvature is relatively different.
A closer look at this parameter reveals that the molecules with a barrier around 0 degrees (Mol 0, Mol 1, and Mol 4) all have a steric clash at that geometry (shown below), leading to a large peak in the energy as the torsion profile is scanned. Including these molecules with a less-straightforward torsion profile may cause issues when attempting to fit a general parameter, and suggests that even selecting small molecules does not necessarily remove the effect of sterics and other types of energy from training the torsions.
...
3. Analytical form of torsion is qualitatively incorrect for all molecules
Just highlighting a couple of other parameters where the analytical form of the torsion should maybe be revisited. There were more than just these couple, and a future direction should be to find a good way to identify torsions whose form should be updated. Note that for t16, the specific driven torsion has larger energy than the QM!
Looking at the energy breakdown for one molecule represented by t16, we can see that the overall MM energy is represented relatively poorly, but that the overall torsion contribution is less egregious than the individual t16 term. This appears to be due to a balance of several very large torsion parameters present for the three-membered rings: t13, t15, and t16, which all have peak magnitudes between 3-8 kcal/mol. This suggests there may be a systematic issue with three-membered ring torsions.
...
One potential source of this issue could be the fact that these torsions are applied to both bridging and internal ring torsions, which would behave extremely differently as the internal ring torsion would have a high energetic penalty. The torsion parameters are only explicitly trained on torsion drives for the bridging motif, although they are applied to the internal ring torsions in the geometry optimization targets, which could affect the value. However, given that it is unlikely for the geometry optimization targets to sample a particularly strained internal ring torsion conformation, I’m not sure this is a likely source of the error.
Fitting Methodology
This project was undertaken in two parts. First, we fit the initial torsion values to this small molecule dataset, and then optimize all the parameters using the entire Sage training dataset. The goal of this approach is to start with initial values that do not incorporate the effects of e.g. large sterics, but allow the values to relax in the presence of larger molecules.
The above small molecule dataset contains about 10 molecules per torsion parameter, but only has coverage for about half of the torsions in Sage 2.1.0. If this subset of torsions shows significant improvement, we can expand the dataset to train all the torsions on small molecules.
Fit initial FF based on small molecules
First, we fit an initial force field to the small molecule dataset. We tried two approaches, differing in the starting point for the force field fit. The first force field is started from the same starting point as Sage, e.g. using MSM values for bonds/angles (calculated based on the whole training dataset), and Sage 2.1.0 initial value torsions; this is referred to as Small Molecule Torsion (SMT) + MSM. The second is started from Sage 2.1.0 itself, e.g. bonds, angles, and torsions all taken from Sage 2.1.0.; this is referred to as Small Molecule Torsion (SMT) + Sage. For both force fields, the bonds and angles were held constant at these initial values, while the torsions were trained on the small molecule torsion drive dataset. These fits were performed with the Force Balance 1.9.3 software stack.
Re-fit Sage based on small molecule torsion starting point
Next, we re-fit Sage. For both approaches, we start from the MSM bonds/angles calculated for the whole training dataset, and the torsions taken from the initial force fields above (SMT + MSM and SMT + Sage). All parameters are then trained on the Sage training dataset. The fits were performed with the Force Balance 1.9.3 software stack.
Results
The results of benchmarking the two force fields, using the industry benchmark dataset, are shown below. While the ddE benchmark is comparable to Sage 2.1.0, the structural metrics RMSD and TFD are both substantially worse than Sage 2.1.0. All molecules in the dataset are covered by at least one of the re-trained torsions, so you can’t filter for only molecules affected by the re-train.
...
There are a couple of possible areas of improvement. One could be to allow the bonds and angles to optimize during the initial FF fit, although this was not pursued initially as it could cause issues if the bonds/angles change dramatically and then the torsions are fit using very different bonds/angles than Sage. Another potential area for improvement would be filtering out small molecules with large steric clashes, like those present in t82, as these could bias the torsions. A third area would be expanding the dataset to include more molecules per torsion, and cover all the torsions. However, these are not being pursued at this time due to the discouraging benchmark results above.