TM Approach II: More Data
Initial plan and approach: port to Sage 2.2 and generate additional training data
Overview
Summary | This second approach extends the work from Approach I by generating several new datasets to expand training coverage and then refitting the force field again. Additionally, Sage 2.2.0 was released while working on this project, so the new torsion splits were ported from Sage 2.1.0 to 2.2.0. |
---|---|
GitHub repo/branch |
|
Status | COMPLETED |
- 1 Overview
- 2 Milestones and metrics
- 3 Progress and findings
- 3.1 New Datasets
- 3.1.1 Torsion Drive
- 3.1.2 Optimization
- 3.2 Industry Benchmark
- 3.3 Future Work
- 3.1 New Datasets
Milestones and metrics
Stage | Milestone/Benchmark | Contributors | Deadline | Status |
---|---|---|---|---|
Generate additional torsion drive data | Increased torsion coverage in the torsion drive training data | @Brent Westbrook (Unlicensed) | Jul 2024 | Completed |
Port torsion splits to 2.2.0 | New initial force field | @Brent Westbrook (Unlicensed) | Jul 2024 | Completed |
Re-fit 2.2.0 to TM data | Re-fit 2.2.0 to TM data | @Brent Westbrook (Unlicensed) | Jul 2024 | Completed |
Benchmark | Improved or equivalent performance on industry benchmark data | @Brent Westbrook (Unlicensed) | Jul 2024 | PASSED |
Progress and findings
New Datasets
These datasets were generated from the ChEMBL 33 database to fill gaps in the proper torsion coverage in Sage 2.1.0/2.2.0.
Torsion Drive
Optimization
Industry Benchmark
The new TM force field performs slightly better than both Sage 2.1.0 and 2.2.0 on the industry benchmark dataset, with the minor exception of some additional outliers in the bond, angle, and proper torsion internal coordinate RMSDs. Even here, the aggregate statistics support an improvement in the TM force field, making this round of refits a success.
Future Work
Despite the success, there is still at least one lingering question, namely whether we should also try to adjust the shapes of the split torsions. This seems philosophically like a good idea, but my initial exploration of torsion shapes demonstrated what seems to be commonly known in force field circles: there is little correspondence between the shape of the MM torsion potential and the QM potential. This makes it very difficult to reason about torsion shapes based on QM data and thus unclear how to proceed in this direction without extensive, manual trial and error.
Another, more obvious, todo item is refitting Sage 2.2.1, which has also now been released, to the new TM data. This is basically trivial to do with my valence-fitting
pipeline and should produce the same results observed here, but it has not been done yet.