Hackathon: How to train your force field with smee
Resources
Reference blog post
smee / descent repositories:
https://github.com/SimonBoothroyd/descent-ff/tree/main
https://github.com/SimonBoothroyd/smee
https://github.com/SimonBoothroyd/descent
A notebook and conda env file here that should have the broad steps broken down, not re-checked to any particular degree, much drawn from Simon and Josh’s and Brent’s work, all mistakes LW’s
– may need to add Jupyter / ipykernel too
^^ some compressed test data that can be loaded if downloading is taking too long
Cache file:
Datasets
^^ some compressed test data that can be loaded if downloading is taking too long
Environment
– may need to add Jupyter / ipykernel too
Workflow
QM Data
Your favorite data set in sqlite
Input settings
Starting FF
openff-2.2.1
Enhancement ideas
Switch interchange conversion to use NAGL so it goes a lot faster / no need to assign expensive charges
Data access
Extend/replace example with one that queries for specific molecules or chemistries rather than pulling all of a specific dataset
Make data download way faster somehow (use far less data? pull only a subset that we need?)
QCF feature requests
If an offline dataset view is told to fetch entries, have a way to disable the “this is an offline view” error (we had a
include='**'
view that we tried to plug into a pre existing workflow for a workshop, but it was failing at a fetch command deep in a library even though we know that it already has the info)
Benchmarking the resulting force field
Fitting flexibility/examples:
What if I want to fit only a few specific parameters, e.g. a couple specific torsions?
LW:
include=[torsion_smirks_1, torsion_smirks_2]
Extend/replace example with one that queries for specific molecules or chemistries rather than pulling all of a specific dataset
For example, if I said I wanted to improve … bond parameters for a (insert specific SMARTS pattern) bond from specific OpenFF datasets, can you give me an example where I can provide a list of datasets and query for all the relevant data involving that bond, and pull it down (more rapidly?) and then fit to just that bond?
LW/TG: not currently possible. I (LW) personally downloaded a registry mapping SMILES to QCA IDs of all of QCArchive and query that.
Add more logging (like how forcebalance prints out gradients each step). What all would we want to enable here?
Tests for smee | Protein fit
Compare energies between FB, Interchange, and smee
CC experiment is to turn off denominator and attenuation in FB, then compare to default smee
Rewrite
Rewrite loss function inside optimization loss
match priors to scaling
Check availability of ForceBalance attenuation
match target weights
LW: not immediately sure how to do this… split the datasets into small-molecule vs protein and weight contribution to loss separately?