Optimization Benchmarking Protocol - Season 1

This document details the execution procedure for performing geometry optimizations to benchmark OpenFF forcefields as described in https://openforcefield.atlassian.net/wiki/spaces/FF/pages/760086545.

All steps assume the deployed conda environment is active, e.g. with:

conda activate openff-benchmark-optimization

 

1. Validation and identifier assignment

The first step in the protocol performs validation of the molecules you have chosen. This step will check that the OpenFF Toolkit can sufficiently interpret each molecule, and for molecules that it can, a unique identifier of the form COM-XXXXX-YY is assigned to each, with:

  • COM: three-letter company code

  • XXXXX: molecule-index; 00000, 00001, 00002,…

  • YY: numerical conformer-index; 00, 01, 02,…

Perform this step using:

openff-benchmark preprocess validate -g <3-character-group-identifier> <input_molecules.sdf>

For the burn-in, use TST for <3-character-group-identifier>. For your production set, use your unique, assigned group identifier instead.

Note: If you have a directory of sdf files rather than a single input you can also use

openff-benchmark preprocess validate -g <3-character-group-identifier> <input_folder/*>

Validated molecules will be placed by default in the directory 1-validate_and_assign. Each will be placed in its own SD file, with its assigned identifier as its name, e.g. COM-00000-00.sdf. This identifier is also encoded within the SD file’s key-value pairs; the key-value pairs are the canonical source for this identifying information.

SD files for molecules that fail to pass validation are placed in 1-validate_and_assign/error_mols. For each SD file, a corresponding text file attempts to give a summary of the error and its context.

All metadata, including names, is stripped from input molecules during validation. A mapping from input molecule name, input file, and index in the input file to assigned workflow ID is provided in the output directory, in name_assignments.csv.

Conformers are aligned during this step, and atom indexing of input molecules may change. If a subsequently-loaded conformer has an RMSD of less than 0.1 A from an existing conformer, it will be sorted into the error_mols directory.

2. Conformer generation

The second step in the protocol generates additional conformers beyond those you provided in the input set of molecules. This step will attempt to generate up to 10 conformers per molecule in total, with selection based on sufficient RMSD difference from the conformers provided by the user and from other generated conformers.

Perform this step using:

All molecule conformers will be placed by default in the directory 1-generate-conformers. As for the validation step, each will be placed in its own SD file, with its assigned identifier as its name, e.g. COM-00000-00.sdf, and key-value pairs encoding this identifying information.

Conformers provided by the user in the previous step will be preserved through this step, and their conformer IDs will remain unchanged.

3. Coverage report

The third step in the protocol creates a coverage report that gives the number of times each parameter in the openff_unconstrained-1.3.0.offxml forcefield is exercised by the molecules in the dataset.

Perform this step using:

You may share the resulting file 3-coverage_report/coverage_report.json with OpenFF or others, as this data is aggregated across all molecules in the set and is informative for understanding the relative use of each parameter.

Any molecules that couldn’t be parameterized with this forcefield will be placed in 3-coverage_report/error_mols/. All molecule conformers that could successfully be parameterized are found as SDFs in 3-coverage_report. These successful molecule SDFs are used used inputs for geometry optimization.

4. Optimization execution

The fourth step in the protocol executes the optimizations required for benchmarking. There are two stages to this step:

  1. Perform QM optimizations using b3lyp-d3bj/dzvp via psi4.

  2. Perform MM optimizations using a variety of MM forcefields, in particular OpenFF, starting from the minimized QM structures for each molecule conformation.

We must perform QM then MM for each molecule in this way. There are several options for how to do this, and the choice of option depends on the compute resources at your disposal and your preferences for data persistence in self-hosted QCFractal Server instance.

The compute options given here are ordered by preference from the protocol developers (most preferred to least preferred). We advise you to start with option (A), and if this won’t work for you, proceed to option (B), and so on. The options become progressively simpler in terms of software complexity, at the cost of becoming operationally more involved.

Multi-node approaches

A. Multi-node, persistent server, long-lived manager

For this option, you must have:

  1. A persistent QCFractal Server running on a host/port network-accessible from at least one node of your cluster.

  2. The node(s) of your cluster that can reach your QCFractal Server must be capable of submitting jobs to your cluster’s queueing system, e.g. SLURM, PBS.

  3. You must be allowed to run a long-running process on the cluster node indicated above in (2) and (3) by your cluster admins.

The commands below assume you are hosting the QCFractal Server on the host myserver.corp.local, with server URI myserver.corp.local:7777. Change this URI to one that resolves to your self-hosted QCFractal server on your network.

Set up your QCFractal Server if it is not already initialized and running.

You will be starting up a QCFractal Manager on the cluster node that you would normally submit jobs from. This will perform computations required by the Server, and will submit jobs to your cluster’s queue to accomplish this. You will need to set up a config.yml file for configuring the Manager to use your cluster. It should look something like this:

More detailed information about Manager configuration can be found in the QCFractal documentation. Fill in <items_in_brackets> according to your needs.

Start up the QCFractal Manager with the following (ideally in a separate terminal, wrapped in tmux, byobu, or screen):

To submit and execute the QM stage:

Periodically, you can export the data from this dataset with:

Then, to submit and execute the MM stage using the QM stage data as input:

As with the QM step, you can export the data from this dataset with:

Note that when you export, only complete optimizations will be exported to the output directory. You can use even a partial export of QM data as the input for the MM submission, and do so repeatedly, to get partial MM results for cases where the QM data has completed and been exported. In other words, the submission and export commands given above are idempotent, and should be run as desired until both QM and MM datasets are complete.

You can check the status of a dataset with:

B. Multi-node, persistent server, short-lived managers

If you cannot run a long-running Manager process on cluster node with access to your cluster’s queue, or if you encounter difficulties with option (A), you can instead launch short-lived managers as compute jobs to your cluster.

Create a submission script for e.g. SLURM called submit_manager.sh as:

And submit with:

No additional Manager configuration is necessary here, as the Manager does not interact with the queueing system as it does in option (A); it only runs a process pool within the compute node it launches on. Submit as many managers as you can to saturate available compute nodes.

To submit and execute the QM stage:

Periodically, you can export the data from this dataset with:

Then, to submit and execute the MM stage using the QM stage data as input:

As with the QM step, you can export the data from this dataset with:

Note that when you export, only complete optimizations will be exported to the output directory. You can use even a partial export of QM data as the input for the MM submission, and do so repeatedly, to get partial MM results for cases where the QM data has completed and been exported. In other words, the submission and export commands given above are idempotent, and should be run as desired until both QM and MM datasets are complete.

You can check the status of a dataset with:

C. Multi-node, no server/manager setup

If you have access to a cluster, but you cannot run a persistent Server process on your network in a way that is network-accessible from the cluster, then start with this option.

This is similar to approach (E), except you will execute each command on batches of files given as inputs to jobs submitted to your cluster’s queuing system.

If your cluster uses e.g. SLURM, create a submission script like the following, called submit_molecules.sh:

To execute the QM stage, you could then do, e.g. submitting one job per conformation and using 4 threads and 24GiB of memory for each:

And likewise for the MM stage, using 1 thread and 6GiB for each:

If your queueing system limits the number of jobs you can submit at one time, you may consider giving multiple files as arguments at the end of the sbatch call instead of a single file. For example, you could batch files for the QM step in groups of 10 with xargs:

with the then submit jobs for the MM step with:

Single-node approaches

D. Single-node, persistent server

If you have only a single machine at your disposal and wish to use a self-hosted QCFractal Server instance, then choose this option.

The commands below assume you are hosting the QCFractal Server on the same host you are running commands on, with server URI localhost:7777. Change this URI to one that resolves to your self-hosted QCFractal server on your network if it is running on another machine.

Set up your QCFractal Server if it is not already initialized and running.

Start up a QCFractal manager with, ideally in a separate terminal:

To submit and execute the QM stage:

Periodically, you can export the data from this dataset with:

Then, to submit and execute the MM stage using the QM stage data as input:

As with the QM step, you can export the data from this dataset with:

Note that when you export, only complete optimizations will be exported to the output directory. You can use even a partial export of QM data as the input for the MM submission, and do so repeatedly, to get partial MM results for cases where the QM data has completed and been exported. In other words, the submission and export commands given above are idempotent, and should be run as desired until both QM and MM datasets are complete.

You can check the status of a dataset with:

E. Single-node, no server/manager setup

If you have only a single machine at your disposal and have no interest in calculation results stored in a self-hosted QCFractal Server instance, then choose this option.

Optimizations will be performed in series, one at a time. Use the --nthreads flag to set the number of cores on the machine you wish each optimization to use; use the --memory flag to set the memory limit for each optimization.

To execute the QM stage:

Then, to execute the MM stage using the QM stage data as input:

Because thread-parallelism may not scale as well as simply performing many optimizations at once with your machine, consider running multiple execute commands at once, giving each a batch of files to work on. You can do this in one shot with xargs. For example, for the QM step we can run 4 commands at any given time, each getting a batch of 10 files to operate on in series with 2 threads, 6GiB of memory:

And likewise for the MM step, we might choose 1 thread per command, and 8 commands:

4-2. Filtering problematic molecules

During Season 1 production runs, it was found that iodine-containing molecules in particular gave clearly wrong results under our choice of method/basis for QM, and should therefore be filtered from downstream analysis. In addition, silicon and boron are not currently supported by OpenFF forcefields, and though these should have been filtered during coverage report generation, a bug in that component allowed them to pass through (since fixed).

To prevent molecules containing iodine, silicon, or boron from passing through to analysis, we filter them here.

To filter the QM results:

To filter the MM results:

The directories 4-compute-qm-filtered and 4-compute-mm-filtered will then be used for the analysis below.

5. Schrodinger Optimization

General help about the Schrodinger command group:
openff-benchmark schrodinger --help

General comments

The schrodinger command group needs access to Schrodinger binaries. They will try to use the SCHRODINGER environment variable. If it is not set, the path to the Schrodinger binaries can be set in the --schrodinger-path command line option of the following commands.

Additionally, the OPLS force field makes use of custom parameters. The custom parameter directory can be set with the --opls-dir command line option.

The commands ffbuilder and optimize have the additional options to set --host and --max-jobs. With the --host option, you can specify the host (queue) on which the ffbuilder or optimization should be run. The name of the host has to be specified in your schrodinger.hosts configuration file. With --max-jobs, you can set the maximal number of parallel subjobs run on the host. If these options are not specified, default settings will be used.

OPLS3e vs OPLS4

Whether OPLS3e or OPLS4 is used, is automatically specified by the Schrodinger version you are using. If you use Schrodinger versions 2020-4 or below, the force field will be OPLS3e, if you use Schrodinger versions 2021-1 or above, the force field will be OPLS4.
If you ran the ffbuilder command for OPLS3e already, you can use the ffbuilder output to fit OPLS4 (not included in the openff-benchmark commands).

Step 1: Build custom parameters

Start up a ffbuilder custom parameter calculation using default arguments and the QM optimized molecules:

openff-benchmark schrodinger ffbuilder [--opls-dir ~/.schrodinger/opls_dir] 4-compute-qm/b3lyp-d3bj/dzvp/

It takes into account already available custom parameters in the path given as the option --opls-dir. If --opls-dir is not given or the specified path is not available (i.e. if you are a new user), all parameters will be calculated from scratch. If you run the ffbuilder for the first time and don't have custom parameters yet, do not specify the --opls-dir as this will lead to an error. The same counts if you used OPLS3e before and run the ffbuilder for the first time with OPLS4. The command will create a ffbuilder job ffb_openff_benchmark in the directory 7-schrodinger-ffb. If there are already output files in the output directory, they will be moved to a backup directory 7-schrodinger-ffb/ffb_openff_benchmark.bk[.x] and previously build custom parameters will be automatically merged with the custom parameter path. The output directory can be changed with the -o/--output-path option.

Now you have to wait until the ffbuilder job has finished. You can check the progress of the calculations in the file 7-schrodinger-ffb/ffb_openff_benchmark.log.

Note: You might realize that the ffbuilder input file ffb_input.sdf includes less conformers than you gave as an input. That is correct as the ffbuilder needs only one conformer per molecule.

Step 2: Merge parameters in custom parameter path

This command merges the newly built parameters to your custom parameter path.

openff-benchmark schrodinger ffmerge 7-schrodinger-ffb

The input path must be the output path of step 1. The custom parameters path can be specified with the option --opls-dir, which defaults to ~/.schrodinger/opls_dir/. If the specified or the default --opls-dir does not exist, the directory will be created and initialized. This could be the case if you are a new user or you want to create a separate custom parameter directory.

Note: For Schrodinger 2021-1 and later, the above command won’t work in the current version (10/06/2021). The quick workaround is to run instead of the above command the standard Schrodinger command:

$SCHRODINGER/utilities/custom_params merge 7-schrodinger-ffb/ffb_openff_benchmark_oplsdir ~/.schrodinger/opls_dir

 Afterwards you can continue the protocol with the next steps (3a or 3b, see below).

Step 3a: Run optimization without custom parameters

Optimization using OPLS3e using NO custom parameters:

openff-benchmark schrodinger optimize 4-compute-qm/b3lyp-d3bj/dzvp/

This will create a Schrodinger macromodel optimization run in the directory 8-schrodinger-mm-default. The latter directory can be changed with the -o/--output-path option. If the directory already exists and you want to replace the data, you need to run the command with the option --delete-existing.

Step 3b: Run optimization with custom parameters

By adding a custom parameter path to the optimize command, the optimization will use OPLS3e with custom parameters. Warning: If you have not built the custom parameters correctly and merged them to your custom parameter path, (steps 1 and 2), the command will use the available custom parameters, but the newly built parameters will be missing.

openff-benchmark schrodinger optimize --opls-dir ${HOME}/.schrodinger/opls_dir 4-compute-qm/b3lyp-d3bj/dzvp/

This will create a Schrodinger macromodel optimization run in the directory 9-schrodinger-mm-custom. The latter directory can be changed with the -o/--output-path option. If the directory already exists and you want to replace the data, you need to run the command with the option --delete-existing.

Step 4: Postprocessing

The output files mmod_output.maegz in the respective directories will be postprocessed with the commands:

openff-benchmark schrodinger postprocess -o 4-compute-mm 8-schrodinger-mm-default/mmod_output.maegz

and

openff-benchmark schrodinger postprocess -o 4-compute-mm 9-schrodinger-mm-custom/mmod_output.maegz

This will save the postprocessed SDF files in 4-compute-mm/opls3e_default/4-compute-mm/opls3e_custom (if you used Schrodinger versions 2020-4 or below) or in 4-compute-mm/opls4_default/4-compute-mm/opls4_custom (if you used Schrodinger versions 2021-1 or above). If these directories exist already and you want to replace the data, you need to run the commands with the option --delete-existing.

Hint:
The last two commands can be run together:
openff-benchmark schrodinger postprocess -o 4-compute-mm 8-schrodinger-mm-default/mmod_output.maegz 9-schrodinger-mm-custom/mmod_output.maegz
)

6. Analysis of results

Once all optimizations have finished, there are two ways to analyze the data. A compares the molecular-mechanics (MM) conformations and energies with the initial quantum-mechanics (QM) minimized conformation and energies, which was used as an input for the MM minimization. B searches for each QM conformation the closest MM conformation (based on RMSD), to ensure that similar conformations are compared. Optionally, a RMSD cut-off can be applied here.

A. Comparison based on same input conformation

With openff-benchmark report compare-forcefields the analysis is executed. The command accepts the paths of the optimized molecules obtained from the optimization step. Additionally, the reference method (b3lyp-d3bj by default) and an output directory can be specified.

The command creates one output csv file per method, which are named <method>.csv, i.e. openff-1.0.0.csv. These files look like this:

The first line is the header. All successive rows show the data for one conformer, and the entries are:

  • the conformer identifier/name

  • the group_name

  • molecule_index

  • conformer_index

  • the RMSD to the respective reference (QM) conformer

  • the torsion deviation fingerprint with respect to the reference (QM) conformer

  • the relative energy deviation.

B. Comparison by matching conformers

The sub command match-minima accepts the same arguments as compare-forcefield. This command runs longer than compare-forcefieldsbecause it compares the QM conformers to the MM conformers and matches them based on the RMSD.

The output files are very similar. They only contain one more column giving the name of the MM conformer which was matched to the reference (QM) conformer, i.e. which was closest based on the RMSD.

6-2. Generate plots

The final openff-benchmark report plotscommand takes the directory containing the csv files as an input (either the output of compare-forcefieldsor match-minima).

For compare-forcefields:

And for match-minima: