Reproducibility

As part of the https://openforcefield.atlassian.net/wiki/spaces/FF/pages/2603909164 work, I attempted to reproduce the original Sage 2.1.0 fit and benchmark to make sure my scripts were functioning correctly. As shown in the figure below, my attempt to do so (sage-sage) performed considerably worse than the original Sage 2.1.0 fit. As one check, I cloned the Sage 2.1.0 repo and ran ForceBalance with my conda environment from the torsion multiplicity work, giving the my-sage-2.1.0 line, which again seems worse than Sage 2.1.0. I also had to modify the torsion multiplicity environment with an older version of OpenEye to get ForceBalance to run successfully (see this issue for more details). Other investigation into the torsion multiplicity results did reveal a data set filtering issue in my scripts, but correcting this and rerunning the fit didn’t lead to much improvement (sage-sage-new).

After these observations, I abandoned my scripts and tried to reproduce the original Sage 2.1.0 fit exactly using the input files from the Sage 2.1.0 repo and Pavan’s original conda environment on HPC3 (/dfs4/dmobley-lab/pbehara/conda-env/fb_193). The results are shown in the figure below. My first attempt (Pavan env) performed a bit worse than Sage again, but at Lily’s suggestion I repeated the run with the same inputs and environment to measure the variability across runs. This second attempt (Pavan repeat) performed even better than the original Sage force field, so there is a substantial amount of variation between ForceBalance runs with the same inputs.

I also noticed that some of the ForceBalance runs were not converging fully. The table below shows a collection of force fields, their full and un-penalized final objective function values, and their final convergence statuses.

The poor performance of my attempts to reproduce Sage seems to be caused by a lack of convergence, but this lack of convergence might also be associated with the new environment. The pavan-repeat force field had the best performance, despite requiring the fewest number of optimization iterations. On the other hand, the three force fields optimized in the newer environment took as many, or more, steps as any other force field but failed to converge and had final objective function values nearly double those of the successful runs.

The table below contains a few measures for quantifying the difference between the various Sage 2.1.0 repeats in terms of the DDEs (in kcal/mol) since those were the most obvious differences above.

Force Field

Min

1st Qt.

Median

3rd Qt.

Max

Mean

Std. Dev.

Force Field

Min

1st Qt.

Median

3rd Qt.

Max

Mean

Std. Dev.

Sage 2.1.0

-101.80429

-0.94235

0.00000

0.99295

98.68331

-0.07525

2.921243

pavan

-111.86718

-0.92310

0.00000

1.06889

99.62502

-0.03159

2.997984

pavan-repeat

-101.03200

-0.95610

0.00000

0.94300

101.0590

-0.10050

2.709862

All of the data reported here is available in my benchmarking repo. In particular, the output/industry subdirectory contains subdirectories for each of the force fields described herein with the correspondences between the names used here and there given in the table below.

Name Here

Directory

Description

Name Here

Directory

Description

Sage 2.1.0

sage-2.1.0

Original Sage 2.1.0 fit

pavan

pavan-2.1.0

My refit of Sage 2.1.0 input files and Pavan’s original environment

pavan-repeat

pavan-repeat

Same as above, repeated

sage-sage

sage-sage

Sage 2.1.0 force field and training data but refiltered and regenerated through my valence-fitting scripts

sage-sage-new

sage-sage-new

Same as above but with new filtering scheme

my-sage-2.1.0

my-sage-2.1.0

Sage 2.1.0 force field and input files reoptimized with my new conda environment

Benchmark Repeat

Because these benchmarks are computed with Matt’s new ib benchmarking package, we also wanted to check that the variation was not coming from that. As shown in the figure below, repeating just the benchmark on the pavan-2.1.0 force field produced visually identical results, suggesting that the variation is due to ForceBalance itself, not an artifact of the benchmarking process.

Despite the visually-identical performance here, there are some small differences. Below is a histogram of all of the DDEs differing by more than 1 between the two runs:

These account for 72 records out of 68837 in the full benchmark, and plotting histograms of the results with differences greater than 0.1, for example, is pointless because the bar at 0 completely distorts the rest of the graph. The one record way over near 20 is 36967197 with a difference of 19.79927.

The results are similar for the RMSD, this time plotting the records greater than 0.01:

where the most extreme outlier is again 36967197, and the TFDs also greater than 0.01:

For this, there are actually two records in the bin near 0.4: 36998992 and 36998994.

These three records, 36967197, 36998992, and 36998994 have the following SMILES strings, respectively (in a file because CONFLUENCE wouldn’t stop converting them to links):

Other differences

Similar trends are observed for my run of the Sage 2.1.0 refitting with Pavan’s original input files, but my updated environment with ForceBalance 1.9.5 instead of 1.9.3 used in Pavan’s environment. The figure below shows the DDE differences greater than 8 kcal/mol, which account for 176/68837 records.

On the other hand, there are 8222 entries with differences greater than 1 kcal/mol or about 12%. I think this means that many molecules have small differences rather than a few molecules exhibiting huge differences. The differences are fairly evenly split between being better in the new and old versions, however. The absolute value of the new DDE is lower in 27111 cases and lower in the old data in 35071. The two are equal in the remaining 6655 cases. Restricting the plot above to the cases where the old data is better produces a very similar distribution, but obviously reduces the counts. Notably, this also removes the greatest outlier with a deviation greater than 100 kcal/mol.

The records for the cluster around 80 kcal/mol are 36975868, 36983564, and 36997441. These are clearly much worse in the new force field because the original Sage values are --7.8, --6.0, and --11.6, compared to the new values of --96.6, --94.7, and --101.1, respectively.

I also plotted a CDF for these, but there wasn’t much to gain from it. The old force field increases slightly faster than the new, as expected.

Charge Models

Tom Potter reported an issue when running the Sage 2.1.0 refit with AmberTools/RDKit instead of OpenEye where the initial objective function value was “1.67619e+04, compared to 1.09618e+04 for the original Sage 2.1 run.” Initially he was using versions of the toolkit affected by the charge caching bug, but even after updating to the new versions, he “found it gave essentially the same results.” This was somewhat in line with what I had observed with my own Sage 2.1.0 reproduction runs before the charge caching fix, with some example objective function values from a single run below:

Total 1.61326e+04 Total 1.54131e+04 ( -7.195e+02 ) Total 1.53246e+04 ( -8.856e+01 ) Total 1.51405e+04 ( -1.841e+02 ) Total 1.46718e+04 ( -4.687e+02 ) Total 1.46120e+04 ( -5.973e+01 ) Total 1.43540e+04 ( -2.581e+02 ) Total 1.42554e+04 ( -9.860e+01 ) Total 1.41504e+04 ( -1.049e+02 ) Total 1.41092e+04 ( -4.120e+01 ) Total 1.42160e+04 ( +1.067e+02 )

These are somewhat similar to what Tom observed.

Still, we wanted to see what difference the charge model itself would cause (OpenEye AM1-BCC ELF10 vs AmberTools/RDKit AM1-BCC), so I started another Sage 2.1.0 refit starting from the targets and input file from the Sage 2.1.0 repo but with my OpenEye license unexported. I only gave it 6 days of walltime, so it died after iteration 4, but I think that gives us most of the data we wanted to see:

| Iter | Total Obj. | |------+-------------| | 0 | 1.14056e+04 | | 1 | 1.00536e+04 | | 2 | 9.25233e+03 | | 3 | 9.17668e+03 | | 4 | 8.85820e+03 |

Overall these seem a bit closer to the original Sage 2.1.0 values, both initially and as they converge, than what Tom was observing or what I saw with the charge caching issue.

In case we want to follow up on this in the future, here are my input (and main output) files for the run: and the path on HPC3 is /dfs9/dmobley-lab/bwestbr1/refits/ambertools/new. The targets therein are the same as those used by Tom, which in turn are the same as those used for Sage 2.1.0 minus these targets:

opt-geo-batch-175/2003384-18.pdb opt-geo-batch-175/2003385-19.pdb opt-geo-batch-178/2003481-3.pdb opt-geo-batch-178/2003482-4.pdb opt-geo-batch-178/2003484-7.pdb opt-geo-batch-178/2003486-10.pdb opt-geo-batch-9/18433500-29.pdb torsion-18886452/ torsion-2703523/ torsion-2703524/ torsion-2703525/ torsion-2703526/ torsion-2703603/ torsion-2703604/ torsion-2703606/

which cause conformer generation errors in RDKit when running ForceBalance. I also found that I had to remove opt-geo-batch-51/18438154-19 for the same reason, and torsion-18537145 also failed 6 times in the initial iteration before finally succeeding, but it didn’t cause any further problems.