High-level objective(s)
MT:
A flexible container for storing data necessary to specify a simulation, with clean interfaces to other APIs that enable this data to be (1) restructured into an initialized simulation to be shipped off to engines, (2) mapped to a 1-D array view to interact with machine learning-adjacent libraries (3) serialized to portable representations, (4) inspected by the user
...
Highest information-content we can make; a data structure that contains anything we want, but the value comes from converting it to what we want (to_jax, to_openmm)
Diagram
http://docs.google.com/presentation/d/1P-DMu7hmecExRBXpRteOzef-9LJRjt-WQTS3Y3Yh8lU/edit#slide=id.p
Other desired features
Python class/package with a Python and Pythonic API
...
Akin to how to toolkit’s ForceField class structures its handlers
Allows for particular details to be implemented elsewhere and at a lower level, and at a high level it’s just one call that collects everything
Provides a reasonable structure for implementing future functional forms
Interfaces with other systems/APIs/domains/etc
Quantum mechanics (QCArchive or in general): probably not a target, but some other glue may wish to interface to it as part of an optimization routine
Optimization engines (ForceBalance/BayesBalance/etc.):
Molecular mechanics engines (OpenMM/various file formats):
Machine learning-ish libraries (Jax/TensorFlow/timemachine/etc.): Needs to be able to talk to these agnostically, and the important thing is that this object exposes a function-like object that that something else can easily differentiate over
Should parameters be mutable (while being looked up)?
Existing decisions
Needs to contain
Force field parameters, tagged with units
Connectivity
Graph-like sufficient, or need bond orders?
...
Particular details of how a simulation will be executed (i.e. thermostat, timestep, etc.)
Existing decisions
Can we build an internal representation first, and come back later to do interoperability? Thinking yes
Will the “y” that optimizers care about only ever be energies? This makes sense thermodynamically; thinking yes
On top of parameters, SMIRKS, and functional forms, other sort of inputs that should be considered, i.e. what would be “known unknowns” that we can separate out from “unknown unknowns”?
Where in the current infrastructure does this fit (and not fit)?
Should it contain the information needed for other libraries or should it contain everything we care about and just be good at transforming that data into what other libraries need?
Should systems be combine-able? Probably yes (very valuable to users, but tricky to do technically)
How much interoperability can we steal from existing code (mostly thinking InterMol, which is relatively mature in its scope)?
If this uses pint and/or pydantic, should the toolkit be refactored to do the same?
...
Interpolated torsions – dependent on two k values
How does system know which two parameters are involved?
Does system record partial bond order?
torsion smirks='a-b-c-d' k_bondorder1 = 1kcal/mol k_bondorder2=3 kcal/mol
← Interpolation based on partial bond orderJW: User should be able to modify k_bondorder1, k_bondorder2, and maybe partial bond order. But System should not even know integer bond order.
(General) – Will need to know “provenance” of each partial bond, so changes can be propagated
MT: All chemically identical things should have identical values
Should systems be combine-able?
what happens when two different parameters represent chemically identically things? this is bad
what happens two FFs just so happen to describe them the same way? does the system have 1 or 2 unique parameters?
JW: It’s hard to compare parameters, because their application to a particular molecule depends not just on THEIR data, but also on the OTHER parameters in the FF that DIDN’T apply to this particular molecule
parameters have meaning when they are put into a system
“ridiculous idea”: prevent de-duplication by keeping track of a molecule’s smiles and FF - convoluted but possible? for identity checking, some hash function of
CANONICAL SMILES andthe FORCE FIELD, and INDIVIDUAL PARAMETER DATA (would tell you if they differ, but not how and that’s probably fine)This means that two bonds can be determined to be “identical” by comparing just their parent FF and the individual parameter they descend from in that FF, without knowing anything about the molecule
Angles, proper torsions, impropers follow the same rules as bonds (probably)
vdW assignment follows same rules as bonds (probably)
ELECTROSTATICS, HOWEVER…
Library charges follow same rules as bonds, but may get complicated given possible overlaps
charge_from_molecules…. ?
geometry-based charge models (eg. AM1-Mulliken)?
Could expose n_confs, rmsd cutoff, QM settings (grid res, convergence criteria, etc), QM basis, NONE of which are expected to be continuously differentiable
JW : We shouldn’t expose any of these in the system object, since they basically would all require re-running QM if changed, and they’re not expected to be differentiable.
Graph based charge models?
Could expose some featurization settings or model weights… But unclear if meaningfully differentiable
Charge increments?
Error correction terms applied on top of naive AM1 (QM method)-Mulliken (e- population analysis) charges.
Same rules as bonds (though, remember directionality!)
VirtualSites?
How does this account for completely unrelated changes in the FF? (like, having TIP4P water parameters ALSO loaded, but not applying the FF to any waters?)
Some input from Simon (04-13-20):
Would like the system object to contain an Open FF topology, and be built off of a revamped OpenFF Molecule class
JW points to some conflicts between the desire to have the object be mutable and how easy it is to get into weird, self-inconsistent states
Where do virtual sites go? Are they part of the molecule? If you take a water smiles string (“O”) and use the molecule API to make it TIP5P, is it the same molecule?
“OpenMM view” vs. “GROMACS view”, i.e. is everything collapsed together (OpenMM) or can you make an molecule-like object and add it up (like an ITP file in GROMACS)
Sub-/related question: what can be mutable?
DD: think about building a system up (build up from molecules to topology to system) and once you build up something and add it, it’s locked down, i.e. once you construct a molecule and add it to a topology, you can’t modify the molecule.
OFFTK has a similar implementation with the “FrozenMolecule” class
a lot of API calls (in/out) should throw around copies or memory-efficient views
would be useful to scope out the things that we allow users to modify, and lock down the other things
FrozenMolecule is actually kinda mutable, but need to access internal methods (“buyer beware”)
Scattered references:
Genesis of the idea:
Github link macro link https://github.com/openforcefield/openforcefield/issues/310 Differentiation things:
timemachine and associated paper
Units: pint
Interoperability stuff
InterMol:
Github link macro link https://github.com/shirtsgroup/InterMol ParmEd:
Github link macro link https://github.com/ParmEd/ParmEd GMSO:
Github link macro link https://github.com/mosdef-hub/gmso MolSSI role: TBD
...