Project Plan
Driver |
|
Approver | @Lily Wang |
Contributors | @Simon Boothroyd @Joshua Horton |
Informed | @David Mobley @Michael Shirts @Michael Gilson @Daniel Cole |
Objective | Train a set of generalised virtual sites for inclusion in a general molecular force field |
Due date | |
Key outcomes | |
Status | in progress |
Problem Statement
Atom centred charges do not accurately capture the anisotropy around certain moieties, such as the sigma hole present on halogens. The introduction of off-centre charges (virtual sites) can alleviate this alleviate this issue.
Scope
Must have: | |
---|---|
Nice to have: | |
Not in scope: |
Workplan
Open Software
The software required to carry out this project spans most of the OpenFF stack given that changing the charge model will likely have large impacts on everything downstream (i.e. vdW, valence, …). It is expected that this project will require maintenance of and extensions to:
OpenFF Recharge |
|
|
---|---|---|
splore |
|
|
molesp |
|
|
OpenFF Evaluator |
|
|
nonbonded |
|
|
OpenFF Bespokefit |
|
|
absolv |
|
|
Open Data
The project will at minimum need a diverse train and test set of ESP data that is made publicly available via QCA.
Level of theory
It was decided to compute the ESP at the HF/6-31G* level theory as is the current norm in the field. Although not perfect, it is not clear that another candidate that has the right balance of speed to compute and ‘accuracy' (defined in terms of how well do the final charges reproduce properties of interest e.g. Gsolv, Gbind). See
https://openforcefieldgroup.slack.com/archives/CDR1P66Q2/p1635523613025700?thread_ts=1635406892.011100&cid=CDR1P66Q2 for details on this choice.
Generating ESP
The process by which the data is generated is expect to proceed by:
Generate a diverse set of ELF conformers for each molecule in the train (/test) set
Minimize each conformer at HF/6-31G* level theory (no wave functions stored to save storage space on QCA)
Discard any conformers that
Formed h-bonds after minimization
Underwent a connectivity change
Perform a single-point calculation on the minimized conformer to compute and store the wave function
ESP is recomputed from the data on QCA and stored locally using OpenFF recharge
Selecting the ESP train / test set
To ensure we can rapidly generate a large train set of diverse molecules without needing to wait months we chose to use smaller fragments (3-12 heavy atoms) of drug-like molecules. Namely, a dataset that contains a diverse set of fragments generated from the Enamine 10K and 50K diversity libraries, the curated ZINC and ChEMBL (eps=78) molecules provided by Riniker and Bleiziffer and the OpenFF Industry Benchmark Season 1 Public
molecule set will be generated using the Recap
decomposition scheme built in to RDKit
The OpenFF Industry Benchmark Season 1 v1.1
set of molecules was used as the ESP test set.
Selecting the physical property test set
The Ghyd test set contained the entries in FreeSolv that would be assigned a v-site parameter as described in the next section.
Open Science
Virtual sites for pyridine like and aliph(+arom)atic chlorine and bromine will be added to the Sage force field as an initial proof of concept study to show that the fitting pipeline is working well, and as a starting point for further expansion through introduction of additional chemistries and chemical perception splitting of the initial parameters. Both pyridine and halogens have been named in the literature many times as good candidates for v-sites - pyridine due to the localised lone pair of electrons and halogens with large sigma holes.
Initial extra parameters
type=DivalentLonePair, smirks="[#6X3H1a:2]1:[#7X2a:1]:[#6X3H1a:3]:[#6X3a]:[#6X3a]:[#6X3a]1", match="once"
: captures ‘pyridine’ like groups, while excluding possible issues due to groups in the ortho position or effects from heteroatoms in the aromatic ring. This SMIRKS is made very specific for the first set of fits in attempt to simplify the problem as much as possible by removing possible complicating factors. The parameter should be made more general in future fits but this is then of course a chemical perception issue.type=BondCharge, smirks="[#6a:2]-[#17:1]",
type=BondCharge, smirks="[#6A:2]-[#17:1]",
type=BondCharge, smirks="[#6a:2]-[#35:1]",
type=BondCharge, smirks="[#6A:2]-[#35:1]",
only a single v-site is placed in contrast to the study by the Cole group (https://doi.org/10.26434/chemrxiv-2021-hsf8l ) which uses two v-sites. Early experimentation showed that the second v-site tended to optimize close to the first v-site and did not seem to improve the performance too drastically. Fitting v-site charges and geometry seems to be a multi-minima problem however, and so whether one or two v-sites should be preferred should be carefully considered in future fits.
Training procedure
Two force fields will be trained - one with v-sites and one without v-sites. In both cases:
the BCC parameters taken from AM1BCC will be trained to the same ESP data as described above. In the case of the force field with the v-sites, the v-sites (both charges and geometry) will be co-optimized with the BCCs.
The vdW parameters of the Sage force field augmented with the new charge model will be trained against the Sage physical property training set
The valence terms of the force field yielded in step 2. will be trained against the Sage QC training set.
We include the ‘no v-site’ fit to ensure that we do not degrade the BCCs by virtue of the fitting strategy. This allows us to better understand when the ‘v-site’ fit is doing poorly because of the ‘v-sites’ rather than the outlined protocol / data sets.
Test procedure
The performance of the new force field will be initially assessed against two main metrics:
deviation from the ESP of the test set defined above
hydration free energies from the FreeSolv data set, and compared using similar metrics as used in 10.26434/chemrxiv-2021-gsgr4-v3 .
Current Status
The files associated with this project can be found at
and on lilac under /data/chodera/boothros/gnn-charge-models
Open Software
The software is continually evolving to meet the needs of this project, but at present meets the need of
generating v-sites, training these self-consistently with BCCs
re-training the vdW parameters of Sage on top of the new charge model
re-training the valence parameters of Sage on top of the new charge model + vdW parameters
benchmarking against Ghyd
The contains a lot of code that should likely be upstreamed as it is of general utility and can be used to more easily automate the performed fits.
Open Data
The training and test set are available on QCF:
train - OpenFF ESP Fragment Conformers v1.0
test - OpenFF ESP Industry Benchmark Set v1.1
The reconstructed ESP data is available on lilac
in the data-set-labelling/qc-esp
subdirectory.
Open Science
Training
The above described ‘v-site’ force field has been fully re-trained. See train-ff-terms/virtual-sites/vam1bcc-v1/v-sites
and train-ff-terms/virtual-sites/vam1bcc-v1/v-sites-valence
The ‘no-v-site’ force field has the BCC parameters re-trained but time did not permit to refit the vdW and valence terms. No problems are anticipated to fit these and the input files are ready to go. See train-ff-terms/virtual-sites/vam1bcc-v1/no-v-sites
Analysis of the vdw training set has shown encouraging early results. In the Sage training, a mixture of pyridine and pyrrole (at several concentrations) experimentally has a large negative enthalpy of mixing, but simulation with both OpenFF 1.3.0 and Sage gives an enthalpy of mixing near 0:
Likely this is due to the charge model being unable to capture the pyridine lone pair. However, in the train-ff-terms/virtual-sites/vam1bcc-v1/v-sites-vdw
, (which includes a vsite for pyridine) we can see that this issue is corrected:
There’s also an improvement in the treatment of enthalpies of mixing for chlorides (aryl chlorides have a vsite in train-ff-terms/virtual-sites/vam1bcc-v1/v-sites-vdw
). In Sage fitting, there is a systematic overprediction of enthalpies of mixing for chloride-containing compounds:
However, this is corrected in the force field that includes vsites:
Test
See also the same hydration free energy metric on a per-molecule basis:
QM Benchmark wrt Sage on the Industry benchmark set
Forcefield used is