Benchmarking Software Approach

Discovery

Workflow components

Each workflow component from the diagram above is numbered below.
Options for software components indicated for each.

Separability of required workflow components will allow for parallelism in development activity. The dev label on each workflow component indicate qualitative development required for each.

  • dev will require the most development, and so should be prioritized.

  • dev may require some development.

  • dev has a well-known and heavily-used software pathway.

 

  1. Identifier assignment dev

    • new, include in benchmarking library

  2. Conformer generation (~10 conformers per molecule) dev

    • openff-toolkit{rdkit}

  3. Parameterization of molecules dev

    • openff-toolkit{rdkit}

  4. FF coverage report dev

    • Reach out to Trevor, Jessica, Pavan for existing implementations

    • QCSubmit can give a list of all parameters used; doesn’t do counts currently, but could be made to

      • we’ll want counts, as this is richer information and allows us to prioritize coverage gaps

  5. Energy minimization with Psi4 (QM), OpenMM (MM) dev

    • multiple options

      • QCSubmit->QCFractal(->QCEngine->GeomeTRIC->QCEngine->Psi4/OpenMM)

        • this path allows easier extension to torsiondrives; not directly possible without significant development work with other paths

        • orchestration mostly solved in this path compared to the others

      • QCEngine->GeomeTRIC->QCEngine->Psi4/OpenMM

      • GeomeTRIC->QCEngine->Psi4/OpenMM

    • each option requires different considerations for deployment on queueing systems

      • simpler in terms of components may require additional development for deployment

  6. Analysis and report generation dev

    • can use components from benchmarkff; need to extract and fold into benchmarking library

    • no matter the approach chosen for optimizations in (5), we will need extraction tooling for flat file output, reports

Available software components for implementation

  1. QCSubmit

    • encoder of OpenFF's preferences for dataset submissions to QCArchive

    • no compute on its own; requires use of QCFractal if part of workflow

    • important to ensure CMILES metadata in place to allow seamless MM calculations

  2. QCFractal

    • client+worker+server for executing and storing procedures, such as optimizations

    • perhaps not strictly necessary, but may still be easiest path

    • complex solution may present failure modes that we have a hard time pinning down

  3. QCEngine {vital}

    • features wrapper procedure to GeomeTRIC taking as input QCElemental.OptimizationInput

    • no need for QCFractal

    • not certain of value-add vs. GeomeTRIC directly, unless simplifies input

  4. GeomeTRIC {vital}

    • optimization protocol

    • can use QCEngine internally to optimize using gradients from a variety of programs (engines)

  5. benchmarkff

    • evaluation analyses high value

    • not currently installable as a package; only scripts/notebooks

    • dependent on OpenEye Toolkit

    • will likely pull functionality out and create infrastructure home in openff-benchmark

  6. openff-toolkit {vital}

    • required for parameterization of molecules for OpenFF forcefields

  7. openmmforcefields

    • required for GAFF, but also usable as abstraction layer for OpenFF forcefields, others

    • used in QCEngine for OpenMM execution

  8. openff-spellbook

    • nouveau functionality for working with QCArchive data; utility functions in service to Trevor Gokey's research and work

    • possible to pull some prototype functionality we don't have in an infrastructure package

Restricted components

  1. OpenEye Toolkit

    • cannot use for this purpose; must not be necessary for any part of the workflow

Packaging Options

openff-benchmark

Library components and entry points can be placed in openff.benchmark.geometry_optimization.

openff-cli

Could introduce an entrypoint in this package for distribution. (optional, and for later)

Proposal

Interface

Command-line interface executable from any shell preferable.

Identifier assignment

Implemented in openff.benchmark.utils. Can be as simple as a function that takes as input group/company code (3-letter), all molecules (with predefined conformers, if present). Will then produce a mapping of identifiers to molecule objects as: COM-XXXXX-YY

  • three-letter company code (COM)

  • molecule-index (XXXXX)

  • numerical conformer-index (YY); 01, 02, 03,…

Note that with this approach, each molecule submitted in the dataset will have exactly one conformer. We would not be stacking multiple conformers into each Molecule.

Conformer generation

For molecules with fewer than 10 conformers predefined, additional conformers will be generated to give a total of 10. This can already be done with openforcefield.topology.Molecule.generate_conformers.

We will need the mapping from Identifier assignment after this, so it likely makes sense to switch the order of these workflow components.

Remaining questions

  1. Do we care about easily distinguishing which conformers were provided vs. generated after the fact?

Parameterization of molecules

Parameterization of molecules will be performed with e.g.:

from openforcefield.typing.engines.smirnoff import ForceField # Load the OpenFF "Parsley" force field forcefield = ForceField('openff-1.0.0.offxml') # Parametrize the topology and return parameters used off_topology = molecule.to_topology() molecule_labels = forcefield.label_molecules(off_topology)

The labels can then be fed directly to the Forcefield coverage report generator. An entry-point wrapping this and the coverage report can be placed in openff.benchmark.parameterization.

This step should be performed with each forcefield we are benchmarking.

Molecules that fail this step should be noted and left out of the energy minimization submission. We still want these in the coverage report that consumes the output of this step.

Forcefield coverage report

A function taking multiple sets of molecule labels from Parameterization of molecules to generate coverage reports should go into openff.benchmark.parameterization. This will the produce a report giving the counts for each parameter in the forcefield, aggregated over the molecules provided.

Although possible to provide a report for each molecule, to mitigate privacy concerns on the molecules used, it is recommended to generate a single report for the whole dataset.

Remaining questions

  1. Should we enforce reports be aggregated? Can we show how possible it is to back-calculate a molecular structure based on the parameters used to parameterize it?

Energy minimization with Psi4 (QM), OpenMM (MM)

Proposing a three-pronged approach.

  1. High-throughput (primary)

    • QCSubmit->QCFractal(->QCEngine->GeomeTRIC->QCEngine->Psi4/OpenMM)

    • output extraction executable at any time for pulling available data

    • need error cycling process

  2. High-throughput debug approach (secondary)

    • Trevor's local optimization executor

      • add this to QCSubmit; generally usable for OpenFF QCArchive users in debugging

    • components shared with (3)

    • GeomeTRIC->QCEngine->Psi4/OpenMM

    • output still usable for reporting

  3. Fully-local execution (alternative)

    • Like Horton's local TorsionDrive script

    • components shared with (2)

    • GeomeTRIC->QCEngine->Psi4/OpenMM

    • output still usable for reporting

In principle, (2) and (3) could be served via the same entrypoint.
(1) would make use of QCFractal with a persistent server to handle most of the compute orchestration.

These approaches should be given entry-points in openff.benchmark.geometry_optimization.

Once errors fail to clear in (1) and cannot be cleared in (2) or (3), these should be noted as failures in a way consumable by Analysis and report generation.

Analysis and report generation

Outputs produced in Energy minimization with Psi4 (QM), OpenMM (MM) should be directly consumable via an entry-point in openff.benchmark.geometry_optimization. We need the following included for each ID from Identifier assignment:

  1. Relative energies (E_MM - E_QM)

  2. Geometry comparison (RMSD or TFD, MM vs. QM)

Existing implementations should be drawn from benchmarkff. Where implementations are dependent on OpenEye Toolkit, alternatives must be put in place.

Deployment

A document describing compute stack installation, server stand up, and worker submission to queueing systems in use will need to be written and shared. This should include an upgrade pathway for the compute stack. This will likely draw on existing approaches for public QCArchive production compute.