OpenFF System Object Specification Notes

High-level objective(s)

MT:

A flexible container for storing data necessary to specify a simulation, with clean interfaces to other APIs that enable this data to be (1) restructured into an initialized simulation to be shipped off to engines, (2) mapped to a 1-D array view to interact with machine learning-adjacent libraries (3) serialized to portable representations, (4) inspected by the user

From JC

allow us to ultimately construction functions that expose 1D vector of unique mutable continuous parameters that we can use to compute energies and parameter gradients of the energy for the convenience of optimizers and samplers.

From JW

Highest information-content we can make; a data structure that contains anything we want, but the value comes from converting it to what we want (to_jax, to_openmm)

Diagram

http://docs.google.com/presentation/d/1P-DMu7hmecExRBXpRteOzef-9LJRjt-WQTS3Y3Yh8lU/edit#slide=id.p

Other desired features

Python class/package with a Python and Pythonic API

  • PEP8 compliant

  • Easily serializable to various formats

Units

  • OpenMM’s simtk.unit lacks the generality we want, doesn’t do

  • There are some options out there but pint seems to be the best/most popular standard package out there

  • Wish to sometimes have just values exposed, other times values with their tagged units

Tracking parameter origins in parameterized systems

  • Immediate, practical use is “where did this parameter come from, i.e. where in the XML”

forcefield.get_handler('vdW').parameters['[#1:1]-[#7]'] # or forcefield['vdW']['[#1:1]-[#7]']

Get an MM energy quickly

  • without needing to do a ton of OpenMM overhead

Clearer interface between force field and system

  • You can’t really get a discrete force field object from a given OpenMM system

Interoperability

  • Long term: interface to arbitrary engines through something intermol-like like GMSO

    • ParmEd is great but its Amber-centric-ness is a bottleneck for some information loss

  • Currently believe we can do this later (so long as we don’t add in restrictions along the way)

“Plugin” architecture

  • Akin to how to toolkit’s ForceField class structures its handlers

  • Allows for particular details to be implemented elsewhere and at a lower level, and at a high level it’s just one call that collects everything

  • Provides a reasonable structure for implementing future functional forms

Interfaces with other systems/APIs/domains/etc

  1. Quantum mechanics (QCArchive or in general): probably not a target, but some other glue may wish to interface to it as part of an optimization routine

  2. Optimization engines (ForceBalance/BayesBalance/etc.):

  3. Molecular mechanics engines (OpenMM/various file formats):

  4. Machine learning-ish libraries (Jax/TensorFlow/timemachine/etc.): Needs to be able to talk to these agnostically, and the important thing is that this object exposes a function-like object that that something else can easily differentiate over

  5. Should parameters be mutable (while being looked up)?

Existing decisions

Needs to contain

  • Force field parameters, tagged with units

  • Connectivity

    • Graph-like sufficient, or need bond orders?

May or may not contain

  • Atomic positions

  • Box vectors

Will not contain

  • Particular details of how a simulation will be executed (i.e. thermostat, timestep, etc.)

Existing decisions

  1. Can we build an internal representation first, and come back later to do interoperability? Thinking yes

  2. Will the “y” that optimizers care about only ever be energies? This makes sense thermodynamically; thinking yes

  3. On top of parameters, SMIRKS, and functional forms, other sort of inputs that should be considered, i.e. what would be “known unknowns” that we can separate out from “unknown unknowns”?

  4. Where in the current infrastructure does this fit (and not fit)?

  5. Should it contain the information needed for other libraries or should it contain everything we care about and just be good at transforming that data into what other libraries need?

  6. Should systems be combine-able? Probably yes (very valuable to users, but tricky to do technically)

  7. How much interoperability can we steal from existing code (mostly thinking InterMol, which is relatively mature in its scope)?

  8. If this uses pint and/or pydantic, should the toolkit be refactored to do the same?

 

Complications:

  • Interpolated torsions – dependent on two k values

    • How does system know which two parameters are involved?

    • Does system record partial bond order?

    • torsion smirks='a-b-c-d' k_bondorder1 = 1kcal/mol k_bondorder2=3 kcal/mol ← Interpolation based on partial bond order

      • JW: User should be able to modify k_bondorder1, k_bondorder2, and maybe partial bond order. But System should not even know integer bond order.

        • (General) – Will need to know “provenance” of each partial bond, so changes can be propagated

      • MT: All chemically identical things should have identical values

  • Should systems be combine-able?

    • what happens when two different parameters represent chemically identically things? this is bad

    • what happens two FFs just so happen to describe them the same way? does the system have 1 or 2 unique parameters?

    • JW: It’s hard to compare parameters, because their application to a particular molecule depends not just on THEIR data, but also on the OTHER parameters in the FF that DIDN’T apply to this particular molecule

  • parameters have meaning when they are put into a system

    • “ridiculous idea”: prevent de-duplication by keeping track of a molecule’s smiles and FF - convoluted but possible? for identity checking, some hash function of CANONICAL SMILES and the FORCE FIELD, and INDIVIDUAL PARAMETER DATA (would tell you if they differ, but not how and that’s probably fine)

      • This means that two bonds can be determined to be “identical” by comparing just their parent FF and the individual parameter they descend from in that FF, without knowing anything about the molecule

      • Angles, proper torsions, impropers follow the same rules as bonds (probably)

      • vdW assignment follows same rules as bonds (probably)

      • ELECTROSTATICS, HOWEVER…

        • Library charges follow same rules as bonds, but may get complicated given possible overlaps

        • charge_from_molecules…. ?

        • geometry-based charge models (eg. AM1-Mulliken)?

          • Could expose n_confs, rmsd cutoff, QM settings (grid res, convergence criteria, etc), QM basis, NONE of which are expected to be continuously differentiable

          • JW : We shouldn’t expose any of these in the system object, since they basically would all require re-running QM if changed, and they’re not expected to be differentiable.

        • Graph based charge models?

          • Could expose some featurization settings or model weights… But unclear if meaningfully differentiable

        • Charge increments?

          • Error correction terms applied on top of naive AM1 (QM method)-Mulliken (e- population analysis) charges.

          • Same rules as bonds (though, remember directionality!)

      • VirtualSites?

    • How does this account for completely unrelated changes in the FF? (like, having TIP4P water parameters ALSO loaded, but not applying the FF to any waters?)

Some input from Simon (04-13-20):

  • Would like the system object to contain an Open FF topology, and be built off of a revamped OpenFF Molecule class

  • JW points to some conflicts between the desire to have the object be mutable and how easy it is to get into weird, self-inconsistent states

    • Where do virtual sites go? Are they part of the molecule? If you take a water smiles string (“O”) and use the molecule API to make it TIP5P, is it the same molecule?

  • “OpenMM view” vs. “GROMACS view”, i.e. is everything collapsed together (OpenMM) or can you make an molecule-like object and add it up (like an ITP file in GROMACS)

  • Sub-/related question: what can be mutable?

  • DD: think about building a system up (build up from molecules to topology to system) and once you build up something and add it, it’s locked down, i.e. once you construct a molecule and add it to a topology, you can’t modify the molecule.

    • OFFTK has a similar implementation with the “FrozenMolecule” class

    • a lot of API calls (in/out) should throw around copies or memory-efficient views

    • would be useful to scope out the things that we allow users to modify, and lock down the other things

      • FrozenMolecule is actually kinda mutable, but need to access internal methods (“buyer beware”)

Scattered references:

  • Genesis of the idea:

  • Differentiation things:

  • Units: pint

  • Interoperability stuff

    • InterMol:

    • ParmEd:

    • GMSO:

    • MolSSI role: TBD

  •