Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

We should make this process easier by curating a data set of systems, described below, and what we think the energy of each should be (including both the potential energy and per-term contributions). This would make it easier for users to validate their implementations and should mostly take control of the details out of their hands. This also provides internal value; it can uncover errors in our own implementations and serve as a natural data set to run regression tests against. The toolkit already runs some regression tests against a non-exhaustive set of molecules in an effort to safeguard against critical bugs, but leaves many use cases uncovered (and could be incorrect itself). Something similar is in the Interchange test suite; tracking reference energies from a canonical data set will reduce duplication of effort and increase the quality and reliability of these tests.

Curation

...

This data set should probably be distributed as a version-controlled set of files and scripts to process those files, supplemented with a table describing everything in detail. I think the result could look something like this, with potentially many (10s to 100s) of rows:

Molecule(s) (file(s) + other descriptors such that the organic chemistry is unambiguous)

Force field

Box vectors (implied by file)

Total energy

Ebond

EvdW

EElectrostatics

Single ligand: /path/to/ligand0.sdf

openff-1.0.0.offxml

(Implied as none by file

?

)

some number kJ/mol

Box of organic molecules in liquid phase: /path/to/liquid_box.pdb + a list of SMILES strings for each compound

openff_unconstrained-1.0.0.offxml

(Read from PDB file

?

)

Protein in vacuo: /path/to/protein.pdb + protein SMILES

(Read from PDB file

?

)

Protein in water with ions and a docked ligand: path/to/complex.pdb + protein SMILES + ligand SMILES + SMILES for water and each ion

(Read from PDB file)

  • Each record should specify, one way or another:

    • Force field (i.e. specific file/DOI)

    • Periodicity

    • Periodic box vectors, if any

    • Atomic coordinates

      • Specified in a file, not generated on the fly

    • Bond graph/connectivity

  • Each section of the SMIRNOFF spec should be covered. This includes

    • Multi-term torsions

    • Improper torsions

    • WBO interpolated torsions and bonds

    • AM1BCC charges

    • Library charges

    • Custom charge increments

    • Constrained and non-constrained force fields

    • Anything specific to biopolymers?

    • GBSA?

    • Virtual sites?

  • Some vaguely broad amount of chemistry should be covered

    • Single ligand(s) in vacuumSome molecules

      • Molecules with charged (zwitterionic?) groups

      • Molecules that are perceived differently by different aromaticity models

      • Molecules with non-planar impropers (i.e. pyrimidal nitrogens)

    • Molecule dimers in vacuum

    • Box of organic liquids

    • Box of water?

    • Protein(s) in vacuum

    • Protein(s) in watersaltwater

    • Ligand(s) bound to protein(s)?

    • Other biomolecules

...