Hackathon: How to train your force field with smee

Hackathon: How to train your force field with smee

Resources

Reference blog post

How to train your force field

smee / descent repositories:

https://github.com/SimonBoothroyd/descent-ff/tree/main
https://github.com/SimonBoothroyd/smee
https://github.com/SimonBoothroyd/descent

  • A notebook and conda env file here that should have the broad steps broken down, not re-checked to any particular degree, much drawn from Simon and Josh’s and Brent’s work, all mistakes LW’s

    • – may need to add Jupyter / ipykernel too

      • ^^ some compressed test data that can be loaded if downloading is taking too long

    • Cache file:

Datasets

    • ^^ some compressed test data that can be loaded if downloading is taking too long

Environment

  • – may need to add Jupyter / ipykernel too

Workflow

QM Data

Your favorite data set in sqlite

Input settings

 

Starting FF

openff-2.2.1

Enhancement ideas

  • Switch interchange conversion to use NAGL so it goes a lot faster / no need to assign expensive charges

  • Data access

    • Extend/replace example with one that queries for specific molecules or chemistries rather than pulling all of a specific dataset

    • Make data download way faster somehow (use far less data? pull only a subset that we need?)

  • QCF feature requests

    • If an offline dataset view is told to fetch entries, have a way to disable the “this is an offline view” error (we had a include='**' view that we tried to plug into a pre existing workflow for a workshop, but it was failing at a fetch command deep in a library even though we know that it already has the info)

    •  

  • Benchmarking the resulting force field

  • Fitting flexibility/examples:

    • What if I want to fit only a few specific parameters, e.g. a couple specific torsions?

      • LW: include=[torsion_smirks_1, torsion_smirks_2]

    • Extend/replace example with one that queries for specific molecules or chemistries rather than pulling all of a specific dataset

      • For example, if I said I wanted to improve … bond parameters for a (insert specific SMARTS pattern) bond from specific OpenFF datasets, can you give me an example where I can provide a list of datasets and query for all the relevant data involving that bond, and pull it down (more rapidly?) and then fit to just that bond?

      • LW/TG: not currently possible. I (LW) personally downloaded a registry mapping SMILES to QCA IDs of all of QCArchive and query that.

  • Add more logging (like how forcebalance prints out gradients each step). What all would we want to enable here?

 

 

Tests for smee | Protein fit

  • Compare energies between FB, Interchange, and smee

  • CC experiment is to turn off denominator and attenuation in FB, then compare to default smee

  • Rewrite

    to match energy evaluation in FB (no forces)

  • Rewrite loss function inside optimization loss

  • match priors to scaling

  • Check availability of ForceBalance attenuation

  • match target weights

    • LW: not immediately sure how to do this… split the datasets into small-molecule vs protein and weight contribution to loss separately?