OpenFF System Object Specification Notes
High-level objective(s)
MT:
A flexible container for storing data necessary to specify a simulation, with clean interfaces to other APIs that enable this data to be (1) restructured into an initialized simulation to be shipped off to engines, (2) mapped to a 1-D array view to interact with machine learning-adjacent libraries (3) serialized to portable representations, (4) inspected by the user
From JC
allow us to ultimately construction functions that expose 1D vector of unique mutable continuous parameters that we can use to compute energies and parameter gradients of the energy for the convenience of optimizers and samplers.
From JW
Highest information-content we can make; a data structure that contains anything we want, but the value comes from converting it to what we want (to_jax, to_openmm)
Diagram
http://docs.google.com/presentation/d/1P-DMu7hmecExRBXpRteOzef-9LJRjt-WQTS3Y3Yh8lU/edit#slide=id.p
Other desired features
Python class/package with a Python and Pythonic API
PEP8 compliant
Easily serializable to various formats
Units
OpenMM’s
simtk.unit
lacks the generality we want, doesn’t doThere are some options out there but
pint
seems to be the best/most popular standard package out thereWish to sometimes have just values exposed, other times values with their tagged units
Tracking parameter origins in parameterized systems
Immediate, practical use is “where did this parameter come from, i.e. where in the XML”
forcefield.get_handler('vdW').parameters['[#1:1]-[#7]']
# or
forcefield['vdW']['[#1:1]-[#7]']
Get an MM energy quickly
without needing to do a ton of OpenMM overhead
Clearer interface between force field and system
You can’t really get a discrete force field object from a given OpenMM system
Interoperability
Long term: interface to arbitrary engines through something intermol-like like GMSO
ParmEd is great but its Amber-centric-ness is a bottleneck for some information loss
Currently believe we can do this later (so long as we don’t add in restrictions along the way)
“Plugin” architecture
Akin to how to toolkit’s ForceField class structures its handlers
Allows for particular details to be implemented elsewhere and at a lower level, and at a high level it’s just one call that collects everything
Provides a reasonable structure for implementing future functional forms
Interfaces with other systems/APIs/domains/etc
Quantum mechanics (QCArchive or in general): probably not a target, but some other glue may wish to interface to it as part of an optimization routine
Optimization engines (ForceBalance/BayesBalance/etc.):
Molecular mechanics engines (OpenMM/various file formats):
Machine learning-ish libraries (Jax/TensorFlow/timemachine/etc.): Needs to be able to talk to these agnostically, and the important thing is that this object exposes a function-like object that that something else can easily differentiate over
Should parameters be mutable (while being looked up)?
Existing decisions
Needs to contain
Force field parameters, tagged with units
Connectivity
Graph-like sufficient, or need bond orders?
May or may not contain
Atomic positions
Box vectors
Will not contain
Particular details of how a simulation will be executed (i.e. thermostat, timestep, etc.)
Existing decisions
Can we build an internal representation first, and come back later to do interoperability? Thinking yes
Will the “y” that optimizers care about only ever be energies? This makes sense thermodynamically; thinking yes
On top of parameters, SMIRKS, and functional forms, other sort of inputs that should be considered, i.e. what would be “known unknowns” that we can separate out from “unknown unknowns”?
Where in the current infrastructure does this fit (and not fit)?
Should it contain the information needed for other libraries or should it contain everything we care about and just be good at transforming that data into what other libraries need?
Should systems be combine-able? Probably yes (very valuable to users, but tricky to do technically)
How much interoperability can we steal from existing code (mostly thinking InterMol, which is relatively mature in its scope)?
If this uses pint and/or pydantic, should the toolkit be refactored to do the same?
Complications:
Interpolated torsions – dependent on two k values
How does system know which two parameters are involved?
Does system record partial bond order?
torsion smirks='a-b-c-d' k_bondorder1 = 1kcal/mol k_bondorder2=3 kcal/mol
← Interpolation based on partial bond orderJW: User should be able to modify k_bondorder1, k_bondorder2, and maybe partial bond order. But System should not even know integer bond order.
(General) – Will need to know “provenance” of each partial bond, so changes can be propagated
MT: All chemically identical things should have identical values
Should systems be combine-able?
what happens when two different parameters represent chemically identically things? this is bad
what happens two FFs just so happen to describe them the same way? does the system have 1 or 2 unique parameters?
JW: It’s hard to compare parameters, because their application to a particular molecule depends not just on THEIR data, but also on the OTHER parameters in the FF that DIDN’T apply to this particular molecule
parameters have meaning when they are put into a system
“ridiculous idea”: prevent de-duplication by keeping track of a molecule’s smiles and FF - convoluted but possible? for identity checking, some hash function of CANONICAL SMILES and the FORCE FIELD, and INDIVIDUAL PARAMETER DATA (would tell you if they differ, but not how and that’s probably fine)
This means that two bonds can be determined to be “identical” by comparing just their parent FF and the individual parameter they descend from in that FF, without knowing anything about the molecule
Angles, proper torsions, impropers follow the same rules as bonds (probably)
vdW assignment follows same rules as bonds (probably)
ELECTROSTATICS, HOWEVER…
Library charges follow same rules as bonds, but may get complicated given possible overlaps
charge_from_molecules…. ?
geometry-based charge models (eg. AM1-Mulliken)?
Could expose n_confs, rmsd cutoff, QM settings (grid res, convergence criteria, etc), QM basis, NONE of which are expected to be continuously differentiable
JW : We shouldn’t expose any of these in the system object, since they basically would all require re-running QM if changed, and they’re not expected to be differentiable.
Graph based charge models?
Could expose some featurization settings or model weights… But unclear if meaningfully differentiable
Charge increments?
Error correction terms applied on top of naive AM1 (QM method)-Mulliken (e- population analysis) charges.
Same rules as bonds (though, remember directionality!)
VirtualSites?
How does this account for completely unrelated changes in the FF? (like, having TIP4P water parameters ALSO loaded, but not applying the FF to any waters?)
Some input from Simon (04-13-20):
Would like the system object to contain an Open FF topology, and be built off of a revamped OpenFF Molecule class
JW points to some conflicts between the desire to have the object be mutable and how easy it is to get into weird, self-inconsistent states
Where do virtual sites go? Are they part of the molecule? If you take a water smiles string (“O”) and use the molecule API to make it TIP5P, is it the same molecule?
“OpenMM view” vs. “GROMACS view”, i.e. is everything collapsed together (OpenMM) or can you make an molecule-like object and add it up (like an ITP file in GROMACS)
Sub-/related question: what can be mutable?
DD: think about building a system up (build up from molecules to topology to system) and once you build up something and add it, it’s locked down, i.e. once you construct a molecule and add it to a topology, you can’t modify the molecule.
OFFTK has a similar implementation with the “FrozenMolecule” class
a lot of API calls (in/out) should throw around copies or memory-efficient views
would be useful to scope out the things that we allow users to modify, and lock down the other things
FrozenMolecule is actually kinda mutable, but need to access internal methods (“buyer beware”)
Scattered references:
Genesis of the idea:
Differentiation things:
timemachine and associated paper
Units: pint
Interoperability stuff
InterMol:
ParmEd:
GMSO:
MolSSI role: TBD