TM Approach II: More Data

Initial plan and approach: port to Sage 2.2 and generate additional training data

Overview

Summary	This second approach extends the work from Approach I by generating several new datasets to expand training coverage and then refitting the force field again. Additionally, Sage 2.2.0 was released while working on this project, so the new torsion splits were ported from Sage 2.1.0 to 2.2.0.
GitHub repo/branch
Status	COMPLETED

1 Overview
2 Milestones and metrics
3 Progress and findings
- 3.1 New Datasets
  - 3.1.1 Torsion Drive
  - 3.1.2 Optimization
- 3.2 Industry Benchmark
- 3.3 Future Work

Milestones and metrics

Stage	Milestone/Benchmark	Contributors	Deadline	Status

Stage	Milestone/Benchmark	Contributors	Deadline	Status
Generate additional torsion drive data	Increased torsion coverage in the torsion drive training data	@Brent Westbrook (Unlicensed)	Jul 2024	Completed
Port torsion splits to 2.2.0	New initial force field	@Brent Westbrook (Unlicensed)	Jul 2024	Completed
Re-fit 2.2.0 to TM data	Re-fit 2.2.0 to TM data	@Brent Westbrook (Unlicensed)	Jul 2024	Completed
Benchmark	Improved or equivalent performance on industry benchmark data	@Brent Westbrook (Unlicensed)	Jul 2024	PASSED

Progress and findings

New Datasets

These datasets were generated from the ChEMBL 33 database to fill gaps in the proper torsion coverage in Sage 2.1.0/2.2.0.

Torsion Drive

Optimization

OpenFF Torsion Multiplicity Optimization Training Coverage Supplement v1.0

Industry Benchmark

The new TM force field performs slightly better than both Sage 2.1.0 and 2.2.0 on the industry benchmark dataset, with the minor exception of some additional outliers in the bond, angle, and proper torsion internal coordinate RMSDs. Even here, the aggregate statistics support an improvement in the TM force field, making this round of refits a success.

Future Work

Despite the success, there is still at least one lingering question, namely whether we should also try to adjust the shapes of the split torsions. This seems philosophically like a good idea, but my initial exploration of torsion shapes demonstrated what seems to be commonly known in force field circles: there is little correspondence between the shape of the MM torsion potential and the QM potential. This makes it very difficult to reason about torsion shapes based on QM data and thus unclear how to proceed in this direction without extensive, manual trial and error.

Another, more obvious, todo item is refitting Sage 2.2.1, which has also now been released, to the new TM data. This is basically trivial to do with my valence-fitting pipeline and should produce the same results observed here, but it has not been done yet.