Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 2 Current »

Driver

Approver

Contributors

Stakeholder

Owen Madin

Simon Boothroyd

Objective

To use bayesian methods to explore the interdependence and correlations between bond charge correction parameters and to quantify the performance of difference combinations of atom types.

Key outcomes

Status

NOT STARTED

NOTE - This is just an initial brain dump so far. Owen Madin will refine this into a full study.

Problem Statement

AM1BCC parameters have been converted to a SMIRNOFF compatible format:

In general a BCC parameter is constructed by combinatorially concatenating two atom environments (defined by separate SMIRKS patterns) with a bond environment (single, double, delocalised, also defined by a SMIRKS pattern).

Ideally we would like to refit these against at least QM electrostatic potential (ESP) data.

There are many different, and in some cases very specific atom and bond types which don’t necessarily map well into the SMIRKS language, e.g. (highly) delocalised lone pairs on nitrogens and dative bonds. Further, there are a total of 354 types - are all of these truly justified by the data we are trying to reproduce?

Bayesian methods give access to posterior distributions from which the correlations (including more complex, non-linear relationships) between BCC parameters can be easily discerned. Further the computation of Bayes factors allow us to quantitatively measure where extra model complexity (in this case extra ‘atom’ types or BCC ‘types’). Combined, Bayes methods should allow us to gain data driven insight into where we perhaps have too many atom types, and where there are types missing.

In particular, N sets of different atom types could be proposed (e.g. one set may have the highly delocalised and delocalised N merged into a single atom type), their Bayes factor computed, and inference made off of that as to which the ESP data supports the most.

(Possible extension to this project - can we just define atom types in terms of an element and it’s connectivity, and then use Wiberg bond order interpolation to scale the BCC? Is there a way to directly incorporate the Wiberg bond order when deciding which type a bond is in? I.e. easily define ‘delocalised' bonds?)

Scope

Start with C, O, H, (reasonably low dimensionality), then extend to N and S (much higher dimensionality).

Train on a least some molecules ideally taken from something like the NCI list or enamine REAL

Methodology

The optimisation of the BCC parameters against ESP data can be written as a Bayesian linear regression problem (or at least, using the same form for the loss function).

A given set of bond charge correction parameters

Error rendering macro: No valid license found for LaTeX Math addon

can be applied to a particular molecule using an assignment matrix

Error rendering macro: No valid license found for LaTeX Math addon

such that

Error rendering macro: No valid license found for LaTeX Math addon

where

Error rendering macro: No valid license found for LaTeX Math addon
is the number of atoms in the molecule and
Error rendering macro: No valid license found for LaTeX Math addon
is the
Error rendering macro: No valid license found for LaTeX Math addon
x 1 vector of partial charge corrections to apply to each atom.

Error rendering macro: No valid license found for LaTeX Math addon
represents the number of times that BCC parameter
Error rendering macro: No valid license found for LaTeX Math addon
should be applied to atom
Error rendering macro: No valid license found for LaTeX Math addon
. E.g. in the case of methanol which would only need a single BCC parameter

Error rendering macro: No valid license found for LaTeX Math addon

The total partial charges

Error rendering macro: No valid license found for LaTeX Math addon
on the molecule are then

Error rendering macro: No valid license found for LaTeX Math addon

where

Error rendering macro: No valid license found for LaTeX Math addon
are a set of partial charges computed directly from a QM method such as AM1. The ESP
Error rendering macro: No valid license found for LaTeX Math addon
at a set of
Error rendering macro: No valid license found for LaTeX Math addon
grid points can be computed as

Error rendering macro: No valid license found for LaTeX Math addon

where

Error rendering macro: No valid license found for LaTeX Math addon
is the distance between grid point
Error rendering macro: No valid license found for LaTeX Math addon
and atom
Error rendering macro: No valid license found for LaTeX Math addon
.

Error rendering macro: No valid license found for LaTeX Math addon
can be split into the contributions from the QM charges and the charge corrections:

Error rendering macro: No valid license found for LaTeX Math addon

where

Error rendering macro: No valid license found for LaTeX Math addon
and

Error rendering macro: No valid license found for LaTeX Math addon

Here we denote

Error rendering macro: No valid license found for LaTeX Math addon

as the design matrix for molecule (or conformer)

Error rendering macro: No valid license found for LaTeX Math addon

Assuming a normal likelihood, the Bayesian likelihood function then becomes

Error rendering macro: No valid license found for LaTeX Math addon

where

Error rendering macro: No valid license found for LaTeX Math addon

and

Error rendering macro: No valid license found for LaTeX Math addon
is an M x 1 vector of the target ESP data for molecule
Error rendering macro: No valid license found for LaTeX Math addon
.

For K molecules / conformers

Error rendering macro: No valid license found for LaTeX Math addon

and

Error rendering macro: No valid license found for LaTeX Math addon

None of

Error rendering macro: No valid license found for LaTeX Math addon
or
Error rendering macro: No valid license found for LaTeX Math addon
depend upon the values of
Error rendering macro: No valid license found for LaTeX Math addon
and so can be pre-computed before the optimisation making the likelihood function rapid to evaluate.

Example implementations are given here:

Choice of Priors

Priors need to be chosen for each of the BCC parameters, as well as for sigma.

For sigma most literature seems to suggest something weakly informative like a half cauchy or a half student T.

For the parameters a Normal distribution with mean 0 and STD 1 would seem to make sense. Given that the base QM method should capture the salient features of the electronic charge distribution, especially the formal charge distribution the values are expected to be either positive or negative, and and likely less than 0.5 in magnitude.

Bayes Factors

These could either be computed by performing RJMC, but could also be computed using ‘free-energy’ like methods such as MBAR whereby a lambda value is introduced which transforms an analytically tractable distribution into the full posterior distribution.

The ‘free-energy’ approach may be better here - this should be easily implementable in current frameworks such as Pyro, and would also mean not having to worry about designing the transition matrices.

ESP

The ESP data will be computed on a FCC grid (spacing TBD) and using a aug-cc-pV(D+d)Z basis and the pw6b95 method as was highlighted by the RESP2 publication as yielding a strong balance of performance and accuracy.

  • No labels