TM Approach II: More Data

 

Initial plan and approach: port to Sage 2.2 and generate additional training data

Overview

Summary

This second approach extends the work from Approach I by generating several new datasets to expand training coverage and then refitting the force field again. Additionally, Sage 2.2.0 was released while working on this project, so the new torsion splits were ported from Sage 2.1.0 to 2.2.0.

GitHub repo/branch

 

Status

COMPLETED

 Milestones and metrics

Stage

Milestone/Benchmark

Contributors

Deadline

Status

Stage

Milestone/Benchmark

Contributors

Deadline

Status

Generate additional torsion drive data

Increased torsion coverage in the torsion drive training data

@Brent Westbrook (Unlicensed)

Jul 2024

Completed

Port torsion splits to 2.2.0

New initial force field

@Brent Westbrook (Unlicensed)

Jul 2024

Completed

Re-fit 2.2.0 to TM data

Re-fit 2.2.0 to TM data

@Brent Westbrook (Unlicensed)

Jul 2024

Completed

Benchmark

Improved or equivalent performance on industry benchmark data

@Brent Westbrook (Unlicensed)

Jul 2024

PASSED

 

Progress and findings

New Datasets

These datasets were generated from the ChEMBL 33 database to fill gaps in the proper torsion coverage in Sage 2.1.0/2.2.0.

Torsion Drive

Optimization

Industry Benchmark

The new TM force field performs slightly better than both Sage 2.1.0 and 2.2.0 on the industry benchmark dataset, with the minor exception of some additional outliers in the bond, angle, and proper torsion internal coordinate RMSDs. Even here, the aggregate statistics support an improvement in the TM force field, making this round of refits a success.

 

image-20241211-191213.png

Future Work

Despite the success, there is still at least one lingering question, namely whether we should also try to adjust the shapes of the split torsions. This seems philosophically like a good idea, but my initial exploration of torsion shapes demonstrated what seems to be commonly known in force field circles: there is little correspondence between the shape of the MM torsion potential and the QM potential. This makes it very difficult to reason about torsion shapes based on QM data and thus unclear how to proceed in this direction without extensive, manual trial and error.

 

Another, more obvious, todo item is refitting Sage 2.2.1, which has also now been released, to the new TM data. This is basically trivial to do with my valence-fitting pipeline and should produce the same results observed here, but it has not been done yet.