Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

As part of the Torsion multiplicity work, I attempted to reproduce the original Sage 2.1.0 fit and benchmark to make sure my scripts were functioning correctly. As shown in the figure below, my attempt to do so (sage-sage) performed considerably worse than the original Sage 2.1.0 fit. As one check, I cloned the Sage 2.1.0 repo and ran ForceBalance with my conda environment from the torsion multiplicity work, giving the my-sage-2.1.0 line, which again seems worse than Sage 2.1.0. I also had to modify the torsion multiplicity environment with an older version of OpenEye to get ForceBalance to run successfully (see this issue for more details). Other investigation into the torsion multiplicity results did reveal a data set filtering issue in my scripts, but correcting this and rerunning the fit didn’t lead to much improvement (sage-sage-new).

After these observations, I abandoned my scripts and tried to reproduce the original Sage 2.1.0 fit exactly using the input files from the Sage 2.1.0 repo and Pavan’s original conda environment on HPC3 (/dfs4/dmobley-lab/pbehara/conda-env/fb_193). The results are shown in the figure below. My first attempt (Pavan env) performed a bit worse than Sage again, but at Lily’s suggestion I repeated the run with the same inputs and environment to measure the variability across runs. This second attempt (Pavan repeat) performed even better than the original Sage force field, so there is a substantial amount of variation between ForceBalance runs with the same inputs.

I also noticed that some of the ForceBalance runs were not converging fully. The table below shows a collection of force fields, their full and un-penalized final objective function values, and their final convergence statuses.

The poor performance of my attempts to reproduce Sage seems to be caused by a lack of convergence, but this lack of convergence might also be associated with the new environment. The pavan-repeat force field had the best performance, despite requiring the fewest number of optimization iterations. On the other hand, the three force fields optimized in the newer environment took as many, or more, steps as any other force field but failed to converge and had final objective function values nearly double those of the successful runs.

The table below contains a few measures for quantifying the difference between the various Sage 2.1.0 repeats in terms of the DDEs (in kcal/mol) since those were the most obvious differences above.

Force Field

Min

1st Qt.

Median

3rd Qt.

Max

Mean

Std. Dev.

Sage 2.1.0

-101.80429

-0.94235

0.00000

0.99295

98.68331

-0.07525

2.921243

pavan

-111.86718

-0.92310

0.00000

1.06889

99.62502

-0.03159

2.997984

pavan-repeat

-101.03200

-0.95610

0.00000

0.94300

101.0590

-0.10050

2.709862

All of the data reported here is available in my benchmarking repo. In particular, the output/industry subdirectory contains subdirectories for each of the force fields described herein with the correspondences between the names used here and there given in the table below.

Name Here

Directory

Description

Sage 2.1.0

sage-2.1.0

Original Sage 2.1.0 fit

pavan

pavan-2.1.0

My refit of Sage 2.1.0 input files and Pavan’s original environment

pavan-repeat

pavan-repeat

Same as above, repeated

sage-sage

sage-sage

Sage 2.1.0 force field and training data but refiltered and regenerated through my valence-fitting scripts

sage-sage-new

sage-sage-new

Same as above but with new filtering scheme

my-sage-2.1.0

my-sage-2.1.0

Sage 2.1.0 force field and input files reoptimized with my new conda environment

Because these benchmarks are computed with Matt’s new ib benchmarking package, we also wanted to check that the variation was not coming from that. As shown in the figure below, repeating just the benchmark on the pavan-2.1.0 force field produced visually identical results, suggesting that the variation is due to ForceBalance itself, not an artifact of the benchmarking process.

  • No labels