OpenFF System Object Specification

Welcoming feedback, specific or along the lines of

  • How does this align with our (internal) objectives?

  • What should be described more thoroughly/in more detail?

  • Is there anything specified here that is pointing down an un-fruitful/headache-filled path?

  • Searching for ? should point you to clear decision points I need feedback on

Aims

The object of the OpenFF System is to enable the use of the SMIRNOFF specification in molecular simulation engines with minimal reliance on external converters and third-party libraries. This will enable researchers to implement force fields developed by the Open Force Field Initiative in their simulation workflows (with a few lines of Python code and/or as part of a CLI) as part of parametrized systems that can be sent to molecular simulation engines.

The internal representation is designed to enable evaluations of the potential energy of a configuration of atoms as described by the SMIRNOFF specification. No sole engine is designated as a target to carry out these calculations, preventing limitations arising from such an assumption. Much of the internals of the System class will be constructed on top of existing infrastructure: in particular, the Open Force Field Toolkit already has a mature ForceField class that manages force field parameters and Topology class that describes the cheminformatics molecular topology. These components will be heavily inspectable by the user; allowing, for example, the source of individual force field parameters, tagged with units, to be inspectable from within the parametrized system.

Simple Usage

This snippet demonstrates how an SMIRNOFF force field and molecular topology (ForceField and Topology classes in the OpenFF Toolkit, respectively) can be used to populate an OpenFF System.

from system import System from openforcefield.topology import Molecule, Topology from openforcefield.typing.engines.smirnoff import ForceField # Load Parsley and populate a dummy topology (ignoring positions for the moment) openff_forcefield = ForceField('openff-1.1.0.offxml') openff_topology = Topology.from_molecules(10 * [Molecule.from_smiles('CCO')]) # Construct an OpenFF System with the force field and topology openff_system = system.System(openff_topology, openff_forcefield) openff_system.to_file('ethanol.top') openff_system.to_file('ethanol.gro')

(I know this API is different than ForceField.create_openmm_system.doc(…) but it seems intractable to me for the toolkit to depend on the system object in the same way that it depends on OpenMM since the system will likely contain or construct from the toolkit’s topology and force field. This would, unfortunately, mean that the actual SMARTS-based parametrization maybe would need to be duplicated internally here. I would like to avoid a dependency loop in which they depend on each other.)

Features

Things stored

  • Some graph-like representation of the topology

  • Complete description of how to compute the potential energy (stored as a ForceField object?)

  • Atomic positions

  • Box vectors

  • Element information

  • Other tags for metadata and provenance

Units

All parameters representing physical quantities will be tagged with units using pint, a popular Python library for managing units. Conversions to other units systems (simtk.unit, implicit units, etc.) will take place at the appropriate interfaces.

Supported Simulation Engines

Highest priority

  • AMBER

  • GROMACS

  • OpenMM

  • CHARMM

Other options (pending user feedback and feasibility studies)

  • LAMMPS

  • TINKER, TINKER+, HIPPO

  • Desmond

  • NAMD

Wishlist

  • Monte Carlo engines

Internal engine compatibility checks will be done when attempting to export to a particular object. In general, the export will error out if the engine cannot faithfully implement the internal representation of the parametrized system (although there will be carefully-written wiggle room for some things like cutoffs). Appropriate, if often verbose, warnings will be printed to the user when approximations and guesses are made.

Potential object lookups

Individual forces will be represented by a subclass or construct of a base Potential object. It will be able to parse functional forms as mathematical expressions and store parameters explicitly tagged with units. Optional metadata-like fields may be used to do things like track the origin of a parameter or the SMIRKS pattern it describes.

Supported functional forms

In principle, any functional form that can be represented by an analytical expression or tabulated data can be stored internally and converted at the relevant interfaces. The following functional forms are used most commonly in the field and will be supported as a higher priority:

Non-bonded potentials

  • Lennard-Jones

  • “Lennard-Jones-like” (i.e. 14-7)

  • Buckingham (Exp-6)

  • Mie

Electrostatics

  • Partial charges

    • (optional) formal charges

Valence potentials

  • Harmonic bonds

  • Harmonic angles

  • Proper torsions

  • Improper torsions

Exceptions

  • Explicitly track non-bonded exceptions (i.e. scaled 1-4 interacitons) for all particle pairs

Constraints

Combining rules

  • Store instructions (i.e. just a Lorentz-Berthelot string) or explicitly calculate i-j cross-interactions?

The following may not be implemented immediately, depending on updates to the SMIRNOFF specification and how particular engines treat them, but are on the roadmap in some capacity:

GBSA

Cross-valence terms

  • CMAPs

  • Urey-Bradleys

Virtual Sites

Polarizability, Dipoles, Multipoles

Serialization

Lossless serialization is provided through exporting a system object to Python dictionaries, from which JSON, messagepack, and other serialization formats are available.

Serialization to engine-specific formats such as PDB, MOL2, GRO, TOP, is available through custom writers that will be adopted, in part, from InterMol.

In-memory conversion

Some limited support will exist for converting to objects in other packages. No conversion will be lossless, but only in edge cases should conversions be prohibitively lossy, and in many cases only a partial view of the object is the target. Some target object may include any of the following:

  1. OpenFF ForceField

  2. OpenFF Topology

  3. ParmEd Structure

  4. OpenMM System

  5. GMSO Topology

  6. MDAnalysis Universe

Representation of chemical topologies

(pending input from Bayly & Calabro and others)

  • A bare minimum amount of connectivity (not much more than what is bonded to what) is necessary for computing energies.

  • Residues should probably be tracked to faithfully encode force field data

  • Other metadata may not be crucial for describing energies, but valuable to users (tradeoff)

Rough sketch of a more involved topology representation

  • Single, high-level container object (System) that contains sufficient data to compute the potential energy

  • At a low level, things become Molecule objects

    • Molecule objects may be de-duplicated through some MoleculeType object

  • More specific Molecule subclasses can be used to (optionally?) encode physical meaning

    • Protein, Ion, Ligand

  • Biopolymers treated with existing conventions (residues and chains)

A simpler model would roughly mimic the existing structure of most engines, in which there is not usually a descriptive distinction between different types of molecules.

How much cheminformatics data should be stored? Some data (bond orders?) may be lightweight but we don’t want to duplicate efforts that already exist in the toolkit and are not useful for MD engines.

Manipulation of systems

  • Combining systems: Systems will be combine-able in a similar manner to the popular ParmEd feature (new_structure = structure1 + structure2)

Interfaces with machine learning libraries

  1. A .parametrize() function that explicitly describes the mapping of force field parameters P to parametrized system parameters Q will enable researchers to use JAX, and potentially other autodiff libraries, to optimize loss functions using existing JAX infrastructure.

  2. Interfaces to other ML libraries can be achieved by returning a flattened 1-D view of system parameters.

Testing

Extensive unit tests of the core Python architecture will be written and regularly executed with GitHub Actions.

Application testing to ensure sufficient interoperability will be done by cannibalizing InterMol’s compute scripts to gets energies of different molecular systems to within some tolerance. Given that the state of the art is something like relative errors of 1e-5 between engines, this threshold is likely to be of a similar order of magnitude.

Documentation

Docs will be written and hosted with a theme common to other OpenFF software. Sections will include

  • Installation instructions for users and developers

  • High-level overview

  • Detailed architecture

  • Tutorials/examples demonstrating both internal features and interoperability

  • Full API documentation

Packaging

The OpenFF System object will be hosted on Anaconda, possibly on the omnia channel at first, but with the intention of using conda-forge when other external technical hurdles are resolved.

Out of scope

The OpenFF System will not be a wizard’s wand that can magically fix all interoperability issues. The primary objective is to enable the use of the SMIRNOFF specification in more molecular simulation workflows.

  • As such, the primary interface will be from the system object to various formats and objects, not the opposite direction. By contrast, reading input files is a desired feature, but is a low priority.

  • Important details about how molecular simulations are executed are not in scope. The OpenFF System object will fully describe the structure of the potential energy function energies, but not how to calculate it in the context of a molecular simulation, i.e. propagating a molecular dynamics trajectory. For example, the choices of barostat, timestep, and ensemble are left to the researcher.

  • Internal data structures will be remarkably general, but not infinitely so. The primary use cases will be in the domain of organic chemistry, specifically implementing the SMIRNOFF format at the molecular scale. A number of scientifically interesting systems will not be supported initially, although efforts will be made to avoiding prohibiting future extensions to do so. Thing includes things like coarse-grained models, multi-body potentials, anisotropic pair potentials, and rigid body.

Miscellaneous

  1. How much modification should be allowed? The software is much easier to implement if we force everything to be immutable, but user modifications (changing parameters, coordinates, connectivity, etc.) may be a valuable set of features. There are some options for a middle ground, like allowing mutability at some points but locking things down at certain API calls (i.e. writing out to disk).

    1. (My) general opinion is that some significant world-building should be enabled, but with clear guardrails in place.

  2. How to get an MM energy quickly? An internal evaluator would be tricky and do a lot of re-inventing the wheel, writing to disk and calling an an engine has some overhead.

    1. Exporting to and calling OpenMM is probably the path of least resistance, although exporting to other engines may be useful given other constraints. InterMol may be able to play a role here, if needed.

  3. Store data (OpenMM’s approach) or store instructions for getting data (just about every other engine out there). Storing just the data is arguably the richest information content, but requires guessing the instructions (or also carrying the instructions along as metadata) for doing most conversions.

    1. Majority currently seems to favor the “instructions” option

  4. Should systems be combine-able? This is a nice feature of ParmEd (big_structure = structure1 + structure2) but may be technically tricky to actually implement here. Re-phrased: how valuable a feature would this be? Can probably come back to it later.

    1. Yes!

  5. Should a ForceField object be tracked, as distinct from just tracking the parameters? This could enable features like writing a “just the parameters used in this study” OFFXML. There are likely some complexities to deal with, like information loss when actually applying a force field to a system.