Bayesian Exploration of BCC Parameters
Driver | Approver | Contributors | Stakeholder |
---|---|---|---|
@Owen Madin @Simon Boothroyd |
|
|
|
Objective | To use bayesian methods to explore the interdependence and correlations between bond charge correction parameters and to quantify the performance of difference combinations of atom types. |
Key outcomes | |
Status | NOT STARTED |
NOTE - This is just an initial brain dump so far. @Owen Madin will refine this into a full study.
Problem Statement
AM1BCC parameters have been converted to a SMIRNOFF compatible format:
In general a BCC parameter is constructed by combinatorially concatenating two atom environments (defined by separate SMIRKS patterns) with a bond environment (single, double, delocalised, also defined by a SMIRKS pattern).
Ideally we would like to refit these against at least QM electrostatic potential (ESP) data.
There are many different, and in some cases very specific atom and bond types which don’t necessarily map well into the SMIRKS language, e.g. (highly) delocalised lone pairs on nitrogens and dative bonds. Further, there are a total of 354 types - are all of these truly justified by the data we are trying to reproduce?
Bayesian methods give access to posterior distributions from which the correlations (including more complex, non-linear relationships) between BCC parameters can be easily discerned. Further the computation of Bayes factors allow us to quantitatively measure where extra model complexity (in this case extra ‘atom’ types or BCC ‘types’). Combined, Bayes methods should allow us to gain data driven insight into where we perhaps have too many atom types, and where there are types missing.
In particular, N sets of different atom types could be proposed (e.g. one set may have the highly delocalised and delocalised N merged into a single atom type), their Bayes factor computed, and inference made off of that as to which the ESP data supports the most.
(Possible extension to this project - can we just define atom types in terms of an element and it’s connectivity, and then use Wiberg bond order interpolation to scale the BCC? Is there a way to directly incorporate the Wiberg bond order when deciding which type a bond is in? I.e. easily define ‘delocalised' bonds?)
Scope
Start with C, O, H, (reasonably low dimensionality), then extend to N and S (much higher dimensionality).
Train on a least some molecules ideally taken from something like the NCI list or enamine REAL
Methodology
The optimisation of the BCC parameters against ESP data can be written as a Bayesian linear regression problem (or at least, using the same form for the loss function).
A given set of bond charge correction parameters
can be applied to a particular molecule using an assignment matrix
such that
where is the number of atoms in the molecule and is the x 1 vector of partial charge corrections to apply to each atom.
represents the number of times that BCC parameter should be applied to atom . E.g. in the case of methanol which would only need a single BCC parameter
The total partial charges on the molecule are then
where are a set of partial charges computed directly from a QM method such as AM1. The ESP at a set of grid points can be computed as
where is the distance between grid point and atom .
can be split into the contributions from the QM charges and the charge corrections:
where and
Here we denote
as the design matrix for molecule (or conformer)
Assuming a normal likelihood, the Bayesian likelihood function then becomes
where
and is an M x 1 vector of the target ESP data for molecule .
For K molecules / conformers
and
None of or depend upon the values of and so can be pre-computed before the optimisation making the likelihood function rapid to evaluate.
Example implementations are given here:
Choice of Priors
Priors need to be chosen for each of the BCC parameters, as well as for sigma.
For sigma most literature seems to suggest something weakly informative like a half cauchy or a half student T.
For the parameters a Normal distribution with mean 0 and STD 1 would seem to make sense. Given that the base QM method should capture the salient features of the electronic charge distribution, especially the formal charge distribution the values are expected to be either positive or negative, and and likely less than 0.5 in magnitude.
Bayes Factors
These could either be computed by performing RJMC, but could also be computed using ‘free-energy’ like methods such as MBAR whereby a lambda value is introduced which transforms an analytically tractable distribution into the full posterior distribution.
The ‘free-energy’ approach may be better here - this should be easily implementable in current frameworks such as Pyro, and would also mean not having to worry about designing the transition matrices.
ESP
The ESP data will be computed on a FCC grid (spacing TBD) and using a aug-cc-pV(D+d)Z basis and the pw6b95 method as was highlighted by the RESP2 publication as yielding a strong balance of performance and accuracy.
Preliminary (non-Bayesian) studies
One of the first questions in using a Bayesian inference approach (or any optimization/parameter search) is whether the target we are using in our posterior is representative of the properties we want to get right.
Specifically, in this case, our posterior is computed by comparing BCC charges to electric field values. The advantage of this method is twofold:
This is the canonical method of training BCC parameters (and electrostatics more generally).
It is relatively computationally cheap, because the electric fields do not be recomputed every time the BCC parameters are changed.
However, this does raise a question of whether this agreement with this RESP2-like electric field target is representative of our overall target (i.e. physical properties such as solvation free energies, mixture enthalpies, densities, etc).
More granularly, the question is whether improvements in the forcebalance optimization target correspond to improvements in physical property performance.
Does the ForceBalance target correspond to physical property accuracy?
In order to test how the FB target differentiates different parameter sets, I ran a set of 50 optimizations against the same target, but with perturbed initial BCC values (drawn from Norm(mu, 2*mu), where mu is the off-1.2.0 value). The optimizations have very different starting values, but all reach relatively similar final values.
Looking at the values of these BCC sets, we see a wide variety despite the similar objective functions:
In order to get a sense of the magnitude of the differences between these BCC sets, I identified the two most different sets and benchmarked them against some solvation free energy data for ethers and alcohols (two moieties known to have some problems with the current FF). Even though these sets have different BCC values, their final objective functions were extremely similar (2.157 vs 2.177):
In this benchmark, we observe a trade-off in quality of reproducing ethers and alcohols:
For the set that improved alcohols (perturb_32), we find that there is a systematic reduction of error in solvation free energies (perturb_32 overpredicts Gsolv less):
For the set that improved ethers (perturb_5), this is less clear, but still seems to happen (perturb_5 underpredicts Gsolv less):
This isn’t clear cut evidence either way for how predictive the objective function is (additional benchmarks are ongoing), since these sets do better at different moieties. However, it does identify a tradeoff in BCC values: alcohols and ethers seem to pull parameters in opposite directions!