Project Plan

Driver
Approver	@Lily Wang
Contributors	@Simon Boothroyd @Joshua Horton
Informed	@David Mobley @Michael Shirts @Michael Gilson @Daniel Cole
Objective	Train a set of generalised virtual sites for inclusion in a general molecular force field
Due date
Key outcomes
Status	status:in progress

Problem Statement

Atom centred charges do not accurately capture the anisotropy around certain moieties, such as the sigma hole present on halogens. The introduction of off-centre charges (virtual sites) can alleviate this alleviate this issue.

ESP of CBr at 1.4 vdW radius computed using HF/6-31G* exhibiting a sigma hole (left), with a FF with no v-sites showing no anisotropy (middle) and with a FF with a single v-site showing improved anisotropy (right)

Scope

Must have:

Must have:
Nice to have:
Not in scope:

Workplan

Open Software

The software required to carry out this project spans most of the OpenFF stack given that changing the charge model will likely have large impacts on everything downstream (i.e. vdW, valence, …). It is expected that this project will require maintenance of and extensions to:

OpenFF Recharge		Reconstruct ESP and EF data from QCA records Estimate ESP / EF using a FF model. Training BCC and v-site parameters and exporting these into a SMIRNOFF force field Generating RESP charges to serve as a ‘reference model’ SMIRKS representation of AM1BCC BCCs
splore		Easily visualise data sets of molecules either local or from QCA
molesp		Compute / visualise the ESP on the vdW surface of a molecule
OpenFF Evaluator		Used to compute the training set of properties while training the vdW parameters
nonbonded		Automate the set-up of training the vdW parameters against the phys-prop data
OpenFF Bespokefit		Automate the set-up of training the valence parameters against QC data
absolv		Benchmark FF with v-sites against solvation / hydration free energy data

Open Data

The project will at minimum need a diverse train and test set of ESP data that is made publicly available via QCA.

Level of theory

It was decided to compute the ESP at the HF/6-31G* level theory as is the current norm in the field. Although not perfect, it is not clear that another candidate that has the right balance of speed to compute and ‘accuracy' (defined in terms of how well do the final charges reproduce properties of interest e.g. G_solv, G_bind). See

https://openforcefieldgroup.slack.com/archives/CDR1P66Q2/p1635523613025700?thread_ts=1635406892.011100&cid=CDR1P66Q2 for details on this choice.

Generating ESP

The process by which the data is generated is expect to proceed by:

Generate a diverse set of ELF conformers for each molecule in the train (/test) set
Minimize each conformer at HF/6-31G* level theory (no wave functions stored to save storage space on QCA)
Discard any conformers that
1. Formed h-bonds after minimization
2. Underwent a connectivity change
Perform a single-point calculation on the minimized conformer to compute and store the wave function
ESP is recomputed from the data on QCA and stored locally using OpenFF recharge

Selecting the ESP train / test set

To ensure we can rapidly generate a large train set of diverse molecules without needing to wait months we chose to use smaller fragments (3-12 heavy atoms) of drug-like molecules. Namely, a dataset that contains a diverse set of fragments generated from the Enamine 10K and 50K diversity libraries, the curated ZINC and ChEMBL (eps=78) molecules provided by Riniker and Bleiziffer and the OpenFF Industry Benchmark Season 1 Public molecule set will be generated using the Recap decomposition scheme built in to RDKit

The OpenFF Industry Benchmark Season 1 v1.1 set of molecules was used as the ESP test set.

Selecting the physical property test set

The G_hyd test set contained the entries in FreeSolv that would be assigned a v-site parameter as described in the next section.

Open Science

Virtual sites for pyridine like and aliph(+arom)atic chlorine and bromine will be added to the Sage force field as an initial proof of concept study to show that the fitting pipeline is working well, and as a starting point for further expansion through introduction of additional chemistries and chemical perception splitting of the initial parameters. Both pyridine and halogens have been named in the literature many times as good candidates for v-sites - pyridine due to the localised lone pair of electrons and halogens with large sigma holes.

Initial extra parameters

type=DivalentLonePair, smirks="[#6X3H1a:2]1:[#7X2a:1]:[#6X3H1a:3]:[#6X3a]:[#6X3a]:[#6X3a]1", match="once" : captures ‘pyridine’ like groups, while excluding possible issues due to groups in the ortho position or effects from heteroatoms in the aromatic ring. This SMIRKS is made very specific for the first set of fits in attempt to simplify the problem as much as possible by removing possible complicating factors. The parameter should be made more general in future fits but this is then of course a chemical perception issue.
type=BondCharge, smirks="[#6a:2]-[#17:1]",
type=BondCharge, smirks="[#6A:2]-[#17:1]",
type=BondCharge, smirks="[#6a:2]-[#35:1]",
type=BondCharge, smirks="[#6A:2]-[#35:1]", only a single v-site is placed in contrast to the study by the Cole group (Exploration and Validation of Force Field Design Protocols through QM-to-MM Mapping ) which uses two v-sites. Early experimentation showed that the second v-site tended to optimize close to the first v-site and did not seem to improve the performance too drastically. Fitting v-site charges and geometry seems to be a multi-minima problem however, and so whether one or two v-sites should be preferred should be carefully considered in future fits.

Training procedure

Two force fields will be trained - one with v-sites and one without v-sites. In both cases:

the BCC parameters taken from AM1BCC will be trained to the same ESP data as described above. In the case of the force field with the v-sites, the v-sites (both charges and geometry) will be co-optimized with the BCCs.
The vdW parameters of the Sage force field augmented with the new charge model will be trained against the Sage physical property training set
The valence terms of the force field yielded in step 2. will be trained against the Sage QC training set.

We include the ‘no v-site’ fit to ensure that we do not degrade the BCCs by virtue of the fitting strategy. This allows us to better understand when the ‘v-site’ fit is doing poorly because of the ‘v-sites’ rather than the outlined protocol / data sets.

Test procedure

The performance of the new force field will be initially assessed against two main metrics:

deviation from the ESP of the test set defined above
hydration free energies from the FreeSolv data set, and compared using similar metrics as used in 10.26434/chemrxiv-2021-gsgr4-v3 .

Current Status

The files associated with this project can be found at and on lilac under /data/chodera/boothros/gnn-charge-models

Open Software

The software is continually evolving to meet the needs of this project, but at present meets the need of

generating v-sites, training these self-consistently with BCCs
re-training the vdW parameters of Sage on top of the new charge model
re-training the valence parameters of Sage on top of the new charge model + vdW parameters
benchmarking against G_hyd

The contains a lot of code that should likely be upstreamed as it is of general utility and can be used to more easily automate the performed fits.

Open Data

The training and test set are available on QCF:

train - OpenFF ESP Fragment Conformers v1.0
test - OpenFF ESP Industry Benchmark Set v1.1

The reconstructed ESP data is available on lilac in the data-set-labelling/qc-esp subdirectory.

Open Science

Training

The above described ‘v-site’ force field has been fully re-trained. See train-ff-terms/virtual-sites/vam1bcc-v1/v-sitesand train-ff-terms/virtual-sites/vam1bcc-v1/v-sites-valence

The ‘no-v-site’ force field has the BCC parameters re-trained but time did not permit to refit the vdW and valence terms. No problems are anticipated to fit these and the input files are ready to go. See train-ff-terms/virtual-sites/vam1bcc-v1/no-v-sites

Analysis of the vdw training set has shown encouraging early results. In the Sage training, a mixture of pyridine and pyrrole (at several concentrations) experimentally has a large negative enthalpy of mixing, but simulation with both OpenFF 1.3.0 and Sage gives an enthalpy of mixing near 0:

Likely this is due to the charge model being unable to capture the pyridine lone pair. However, in the train-ff-terms/virtual-sites/vam1bcc-v1/v-sites-vdw, (which includes a vsite for pyridine) we can see that this issue is corrected:

There’s also an improvement in the treatment of enthalpies of mixing for chlorides (aryl chlorides have a vsite in train-ff-terms/virtual-sites/vam1bcc-v1/v-sites-vdw). In Sage fitting, there is a systematic overprediction of enthalpies of mixing for chloride-containing compounds: