...
Repeating Hyesu’s analysis with the new MSM-derived initial values may be worthwhile for determining the most optimal set of priors, or at least for systematic evidence that this set of priors is the best.
Results
Hyesu’s analysis used SMIRNOFF99Frosst parameters for analysis. I repeated Hyesu’s analysis with the parameters from Sage 2.1.0. I followed the same methodology. The general idea is to use a measure of the spread of initial parameters to determine what the prior (tolerance) should be during optimization. The distribution of each type of parameter is shown below, because the distributions are generally not Gaussian, I used the IQR instead of the standard deviation to characterize them.
Bonds
For bonds, I tried analyzing them all together (90 bonds total), but found in agreement with the previous result that there was a large spread due to type of bond and element composition. To try to account for this, I repeated the normalization procedure that was done previously. I sorted bond parameters by bond order and the elements involved (e.g. C-C single bond, C-N double bond, etc). For any bond type with more than 1 parameter value (e.g. there are several types of C-C single bonds), I subtracted the mean for that type from each parameter, and repeated the analysis on this dataset of normalized parameters (54 bonds total). The reduction in number of datapoints comes from removing points where there was only one type of bond (e.g. N-N triple bond), where subtracting the mean would not be meaningful.
Parameter | Parsley 1.3.0 Prior | Sage 2.1.0 prior | Sage 2.2 prior | SMIRNOFF99Frosst StDev | SMIRNOFF99Frosst IQR | Sage 2.1.0 StDev | Sage 2.1.0 IQR |
---|---|---|---|---|---|---|---|
Bond length (all bonds) (A) | 0.1 | 0.1 | 0.01 | 0.33 | 0.4 | 0.34 | 0.44 |
Bond length (normalized) (A) | 0.1 | 0.1 | 0.01 | -- | 0.1 | 0.04 | 0.06 |
Bond k (all bonds) (kcal/mol/A2) | 100 | 100 | 20 | 249.30 | 338.0 | 552.0 | 517.9 |
Bond k (normalized) (kcal/mol/A2) | 100 | 100 | 20 | -- | 87.5 | 220.8 | 184.0 |
For bond lengths, the all bond dataset shows similar standard deviation and IQR to the SMIRNOFF99Frosst values. However, the normalized dataset has a lower spread, which would support a lower prior.
In both the dataset with all bonds and the normalized dataset, the spread of bond force constant parameters is much larger in Sage 2.1.0 vs in SMIRNOFF99Frosst. This makes sense as Sage 2.1.0 has more bond types that correspond to diverse chemistries, which would lead to more different values. The types of bonds with the largest IQR are C-C double bonds (2 bonds, IQR 219), C-N double bonds (2 bonds, IQR 439), C-O bonds of unspecified order (2 bonds, IQR 291), N-N double bonds (2 bonds, IQR 675), and O-P bonds of unspecified type (2 bonds, IQR 403). Excluding bonds of unspecified type only reduces the IQR to 173.
This methodology would suggest a much larger prior should be used for optimizing bond force constants, on the order of 150-200 kcal/mol/A2. Pavan’s testing showed that a lower prior was needed for force constants, suggesting that this methodology may not yield useful suggestions for priors when we start from the MSM and QM-derived values.
Angles
Following the previous analysis, for angles I analyzed all the parameters together in one dataset, not accounting for variations due to different geometries. The results are below.
...
Parameter | Parsley 1.3.0 Prior | Sage 2.1.0 prior | Sage 2.2 prior | SMIRNOFF99Frosst StDev | SMIRNOFF99Frosst IQR | Sage 2.1.0 StDev | Sage 2.1.0 IQR |
---|---|---|---|---|---|---|---|
Angle (deg) | 10 | 5 | 1 | 24.24 | 10.5 | 23.1 | 12.5 |
Force constant (kcal/mol/rad2) | 50 | 100 | 20 | 95.15 | 40.0 | 76.6 | 121.9 |
Here, the spread of the angle parameter is relatively similar between SMIRNOFF99Frosst and Sage 2.1.0, and this analysis would support use of the Parsley 1.3.0 prior, but suggest that the Sage 2.1.0 and 2.2 priors are too small. For the force constant, once again the spread is much larger than it was in SMIRNOFF99Frosst, which makes sense due to increased chemical diversity in the parameters. This analysis would suggest that the Sage 2.1.0 prior of 100 would be appropriate, but that the Sage 2.2 would be too small. This is at odds with the actual performance of the force fields, where the Sage 2.1.0 priors were found to be too loose and the Sage 2.2 were found to be more appropriate. This suggests again that using the spread to determine the priors may not be appropriate for the MSM and QM-derived starting parameters.
Proper torsions
Proper torsions require some consideration of how to analyze them, as different SMIRKs patterns have different number of torsion force constants. The previous analysis for Parsley 1.3.0 uses the sum of all force constants for each SMIRKs pattern. In addition to this, I consider all of the force constants together, regardless of if it’s k1, k2, etc, as well as just looking at k1 for each SMIRKs. Results did not strongly depend on the method of pre-processing. Results are shown below.
...
Parameter | Parsley 1.3.0 Prior | Sage 2.1.0 prior | Sage 2.2 prior | SMIRNOFF99Frosst StDev | SMIRNOFF99Frosst IQR | Sage 2.1.0 StDev | Sage 2.1.0 IQR |
---|---|---|---|---|---|---|---|
Sum of k’s (kcal/mol) | 2 | 5 | 5 | -- | 2.3 | 2.9 | 1.7 |
All k’s (kcal/mol) | 2 | 5 | 5 | 1.32 | 1.2 | 2.3 | 1.1 |
k1 only (kcal/mol) | 2 | 5 | 5 | -- | -- | 2.8 | 1.2 |
This analysis suggests that Parsley 1.3.0’s prior of 2 was appropriate, and that 5 would be too loose.
Conclusion
Here, I repeated Hyesu’s analysis of the SMIRNOFF99Frosst parameters for Sage 2.1.0 parameters, to determine new priors for future fitting efforts. However, large spreads in the Sage parameters and Pavan’s previous results for Sage 2.2 suggest that these would not be useful values for priors using the MSM starting point.
The priors Pavan identified for Bond and Angle parameters in Sage 2.2 seem to work well, and follow the logic that we should not allow the parameters to stray far from their QM-inspired starting point. For future efforts, I will focus on determining both starting values and prior (tolerances) for the torsion parameters, as these were not modified for Sage 2.2 and require further investigation due to the starting point not being derived from QM.