Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

As part of the Torsion multiplicity work, I attempted to reproduce the original Sage 2.1.0 fit and benchmark to make sure my scripts were functioning correctly. As shown in the figure below, my attempt to do so (sage-sage) performed considerably worse than the original Sage 2.1.0 fit. As one check, I cloned the Sage 2.1.0 repo and ran ForceBalance with my conda environment from the torsion multiplicity work, giving the my-sage-2.1.0 line, which again seems worse than Sage 2.1.0. I also had to modify the torsion multiplicity environment with an older version of OpenEye to get ForceBalance to run successfully (see this issue for more details). Other investigation into the torsion multiplicity results did reveal a data set filtering issue in my scripts, but correcting this and rerunning the fit didn’t lead to much improvement (sage-sage-new).

...

These three records, 36967197, 36998992, and 36998994 have the following SMILES strings, respectively (in a file because CONFLUENCE wouldn’t stop converting them to links):

View file
namesmi

Other differences

Similar trends are observed for my run of the Sage 2.1.0 refitting with Pavan’s original input files, but my updated environment with ForceBalance 1.9.5 instead of 1.9.3 used in Pavan’s environment. The figure below shows the DDE differences greater than 8 kcal/mol, which account for 176/68837 records.

...

On the other hand, there are 8222 entries with differences greater than 1 kcal/mol or about 12%. I think this means that many molecules have small differences rather than a few molecules exhibiting huge differences. The differences are fairly evenly split between being better in the new and old versions, however. The absolute value of the new DDE is lower in 27111 cases and lower in the old data in 35071. The two are equal in the remaining 6655 cases. Restricting the plot above to the cases where the old data is better produces a very similar distribution, but obviously reduces the counts. Notably, this also removes the greatest outlier with a deviation greater than 100 kcal/mol.

...

The records for the cluster around 80 kcal/mol are 36975868, 36983564, and 36997441. These are clearly much worse in the new force field because the original Sage values are --7.8, --6.0, and --11.6, compared to the new values of --96.6, --94.7, and --101.1, respectively.

I also plotted a CDF for these, but there wasn’t much to gain from it. The old force field increases slightly faster than the new, as expected.

Charge Models

Tom Potter reported an issue when running the Sage 2.1.0 refit with AmberTools/RDKit instead of OpenEye where the initial objective function value was “1.67619e+04, compared to 1.09618e+04 for the original Sage 2.1 run.” Initially he was using versions of the toolkit affected by the charge caching bug, but even after updating to the new versions, he “found it gave essentially the same results.” This was somewhat in line with what I had observed with my own Sage 2.1.0 reproduction runs before the charge caching fix, with some example objective function values from a single run below:

Code Block
Total                                                  1.61326e+04
Total                                                  1.54131e+04 ( -7.195e+02 )
Total                                                  1.53246e+04 ( -8.856e+01 )
Total                                                  1.51405e+04 ( -1.841e+02 )
Total                                                  1.46718e+04 ( -4.687e+02 )
Total                                                  1.46120e+04 ( -5.973e+01 )
Total                                                  1.43540e+04 ( -2.581e+02 )
Total                                                  1.42554e+04 ( -9.860e+01 )
Total                                                  1.41504e+04 ( -1.049e+02 )
Total                                                  1.41092e+04 ( -4.120e+01 )
Total                                                  1.42160e+04 ( +1.067e+02 )

These are somewhat similar to what Tom observed.

Still, we wanted to see what difference the charge model itself would cause (OpenEye AM1-BCC ELF10 vs AmberTools/RDKit AM1-BCC), so I started another Sage 2.1.0 refit starting from the targets and input file from the Sage 2.1.0 repo but with my OpenEye license unexported. I only gave it 6 days of walltime, so it died after iteration 4, but I think that gives us most of the data we wanted to see:

Code Block
   | Iter |  Total Obj. |
   |------+-------------|
   |    0 | 1.14056e+04 |
   |    1 | 1.00536e+04 |
   |    2 | 9.25233e+03 |
   |    3 | 9.17668e+03 |
   |    4 | 8.85820e+03 |

Overall these seem a bit closer to the original Sage 2.1.0 values, both initially and as they converge, than what Tom was observing or what I saw with the charge caching issue.

In case we want to follow up on this in the future, here are my input (and main output) files for the run:

View file
nameambertools.tar.gz
and the path on HPC3 is /dfs9/dmobley-lab/bwestbr1/refits/ambertools/new. The targets therein are the same as those used by Tom, which in turn are the same as those used for Sage 2.1.0 minus these targets:

Code Block
opt-geo-batch-175/2003384-18.pdb
opt-geo-batch-175/2003385-19.pdb
opt-geo-batch-178/2003481-3.pdb
opt-geo-batch-178/2003482-4.pdb
opt-geo-batch-178/2003484-7.pdb
opt-geo-batch-178/2003486-10.pdb
opt-geo-batch-9/18433500-29.pdb
torsion-18886452/
torsion-2703523/
torsion-2703524/
torsion-2703525/
torsion-2703526/
torsion-2703603/
torsion-2703604/
torsion-2703606/

which cause conformer generation errors in RDKit when running ForceBalance. I also found that I had to remove opt-geo-batch-51/18438154-19 for the same reason, and torsion-18537145 also failed 6 times in the initial iteration before finally succeeding, but it didn’t cause any further problems.