GNN Charge Models
Driver | |
Approver | @Lily Wang |
Contributors | @Simon Boothroyd Yuanqing Wang @Joshua Horton |
Informed | @David Mobley @Michael Shirts @Michael Gilson @Daniel Cole |
Objective | Develop a graph convolutional charge model that can produce AM1(BCC*) charges to within an accuracy comparable to the original AM1BCC model *BCCs need not be the original Bayly et al parameters |
Due date | |
Key outcomes | |
Status | in progress |
Problem Statement
Applying charge models to larger molecules, especially biopolymers, is currently intractable due to the cost of the QC calculations required. This leads to force fields often needing to define one charge model for proteins, another for small molecules and so on. Ideally a single, self consistent, charge model could be applied to all molecules within a simulation box. Further, charges for proteins are typically defined only for the standard amino-acids, and non-standard residues or post-translational modifications require a splicing, capping, and re-combination scheme which can be cumbersome to implement. A graph convolutional model on the other hand would simply be able to ingest the full protein, complete with any modifications, and yield a set of charges that are self-consistent with both the other residues and any ligands and solvent.
Scope
Must have: | |
---|---|
Nice to have: | |
Not in scope: |
Workplan
Open Software
The software required to carry out this project spans most of the OpenFF stack given that changing the charge model will likely have large impacts on everything downstream (i.e. vdW, valence, …). It is expected that this project will require maintenance of and extensions to:
nagl |
|
|
---|---|---|
OpenFF Recharge |
|
|
splore |
|
|
molesp |
|
|
OpenFF Evaluator |
|
|
nonbonded |
|
|
OpenFF Bespokefit |
|
|
absolv |
|
|
Open Data
The project will at minimum need a diverse train set of precomputed AM1(BCC) charges to train and test the model against as well as a test set of ESP data that is made publicly available via QCA.
We will also generate a data set of RESP charges using data in the OpenFF ESP Fragment Conformers v1.0
QCA data set.
Selecting the AM1(BCC) train / test set
The GNN charge model will be initially trained on the training sets of molecules assembled by Riniker and Bleiziffer (esp=78): https://doi.org/10.3929/ethz-b-000230799
The test set will be composed of molecules from the OpenFF Industry Benchmark Season 1 Public
set available on the QCA.
A validation set composed of the molecules found in the Enamine 10K diversity set will additionally be used, especially when performing hyperparameter sweeps
All molecule sets are additionally augmented with up to two protomers enumerated using the nagl prepare enumerate
command.
See data-set-curation/qc-charges/submit-curate-partial-charge-set.sh
and data-set-labelling/label-am1-charges.sh
for additional details
ESP Level of theory
It was decided to compute the ESP at the HF/6-31G* level theory as is the current norm in the field. Although not perfect, it is not clear that another candidate that has the right balance of speed to compute and ‘accuracy' (defined in terms of how well do the final charges reproduce properties of interest e.g. Gsolv, Gbind). See
https://openforcefieldgroup.slack.com/archives/CDR1P66Q2/p1635523613025700?thread_ts=1635406892.011100&cid=CDR1P66Q2 for details on this choice.
Generating ESP
The process by which the data is generated is expect to proceed by:
Generate a diverse set of ELF conformers for each molecule in the train (/test) set
Minimize each conformer at HF/6-31G* level theory (no wave functions stored to save storage space on QCA)
Discard any conformers that
Formed h-bonds after minimization
Underwent a connectivity change
Perform a single-point calculation on the minimized conformer to compute and store the wave function
ESP is recomputed from the data on QCA and stored locally using OpenFF recharge
Selecting the ESP test set
The OpenFF Industry Benchmark Season 1 v1.1
set of molecules was used as the ESP test set.
Selecting the RESP train / test set
Train / test RESP charges will be generated using data in the OpenFF ESP Fragment Conformers v1.0
and OpenFF ESP Industry Benchmark Set v1.1
QCA data sets respectively using the OpenFF Recharge framework.
Selecting the physical property test set
The Ghyd test set contained the entries in FreeSolv that would be assigned a v-site parameter as described in the next section.
Open Science
The main focus of this project should be to produce a MVP GNN charge model that performs with a similar accuracy as the popular AM1BCC model. This may either be by learning AM1 charges and training a new set of BCCs on top of this, or be training to AM1BCC directly. Learnining AM1 charges and then correcting the model with BCCs may yield a better charge model as the GNN model does not itself need to be quite so accurate, but rather must just capture the salient features of the electron distribution within a molecule.
A secondary goal for the project would then be to assess whether we can instead train the GNN model to more accurate RESP charges.
The performance of all models will be assessed against deviations of the GNN model vs the original AM1BCC model, the ESP, and a set of physical property measurements likely Ghyd and Gsolv
We opt to test against the ESP given the correlation between improving performance on ESPs and improving the performance on free energies. This metric is likely more meaningful than deviations between original AM1(BCC) charges and model predicted charges as it is not clear how ‘big’ of a deviation should be concerning, and the maginitude of a 'concerning deviation’ will likely depend on the chemical environment the atom is in (i.e. large difference in H charges likely more important than deviations of a similar magnitude of Br given the higher number of Hs in a molecule)
GNN architecture
Building off of the earlier work of Yuanqing and Josh, we intend to use a NN architecture that begins with a series of GCN layers that yield a continuous feature vector for each atom in the molecule, followed by a series of dense readout layers that predict the hardness and softness of each atom in the molecule, which can then be used to compute the charge on each atom in the molecule.
At present we intend to also use GraphSAGE GCN layers given their robustness and widespread use in the field, but consideration for switching to a GCN type that supports edge features such as E-GraphSAGE should be considered in the future.
A hyperparameter sweep should be performed to determine:
The number of GCN layers and the number of hidden features for each layer
The number of NN layers and the number of hidden features for each layer
The batch size
The learning rate
where the model with the lowest validation RMSE should likely be preferred. See train-charge-models/gnn-charge-models/submit-train-am1-model-hparams.sh
for scripts to run the sweep.
Atom features
The core features should include the total charge on a molecule, and:
The connectivity of the atom
The element of the atom
Whether the atom is in a ring of X size where X in [3, 4, 5, 6] at minimum
An additional feature to consider is the average formal charge on each atom, where the average is computed over the relevant resonance forms. Here we define ‘relevant’ as the stable resonance forms that differ by an electron transfer between two heteroatoms, and don’t consider different kekule forms of e.g. benzene. See the Resonance section for details.
We likely should not include features such as aromaticity which tie the GCN to a likely flawed aromaticity model, and direct formal charge on atoms as this will depend on the input resonance structure.
Resonance
A GCN will typically not consider atoms that are more than 4-5 atoms away from the atom being considered due to vanishing gradient issues. This means that important phenomenology, such as resonance atoms separated by several bonds
will not be accurately captured. To ensure that resonance information is correctly incorporated into the GCN, we propose applying a modified version of the resonance form enumeration method proposed by Gilson et al (10.1021/ci034148o) to enumerate over all relevant resonance forms of a molecule, and then average the formal charge on each atom across those forms.
The algorithm requires some modification however due to the combinatoric nature of resonance forms. As an exaggerated example, consider a peptide composed of 300 deprotonated glutamic acid residues. In this case there would be 2^300 resonance forms to enumerate!
We propose to instead:
Remove all sp3 carbons and hydrogens from the molecule as these cannot be involved in electron transfers
For every isolated sub-graph
apply the V-Charge resonance enumeration algorithm
compute the average formal charge across sub-resonance forms
This approach reduces the above example then to simply computing average formal charges across 2*300 sub-graphs which is significantly more tractable.
There is some question of how useful it is to include this resonance sensitive feature - i.e. would we have a model with comparable accuracy if we skipped this somewhat cumbersome enumeration step? This should likely be explored as part of this project.
Training procedure
Three different GCN models should be trained once relevant hyperparameters have been selected:
a GCN trained to AM1 charges directly and
a GCN trained to AM1BCC charges
For the model trained only to AM1 charges a set of BCC parameters (taken from the original AM1BCC model) will then be trained to the ESP training data as described above.
An additional set of BCCs should also be trained on top to the GCN-AM1BCC model with initial values of 0.0. This will again allow the GNN some margin for error rather than needing to produce ‘perfect’ AM1BCC charges as we can correct these against ESP data as was the original intent of the BCCs in the AM1BCC model (i.e. we know AM1 is not perfect but we can greatly improve it’s accuracy through BCCs).
In all above cases:
The vdW parameters of the Sage force field augmented with the new charge model will be trained against the Sage physical property training set
The valence terms of the force field yielded in step 2. will be trained against the Sage QC training set.
Test procedure
The performance of the new force field will be initially assessed against two main metrics:
deviation from the ESP of the test set defined above, were the ESP produced by RESP charges will be used as a reference
hydration free energies from the FreeSolv data set, and compared using similar metrics as used in 10.26434/chemrxiv-2021-gsgr4-v3 .
Current Status
The files associated with this project can be found at and on lilac under /data/chodera/boothros/gnn-charge-models
Open Software
The software is continually evolving to meet the needs of this project, but at present meets the need of
training GNN charge models, and performing parameter sweeps
training BCCs on top of a GNN charge model.
re-training the vdW parameters of Sage on top of the new charge model
re-training the valence parameters of Sage on top of the new charge model + vdW parameters
benchmarking against Ghyd
The contains a lot of code that should likely be upstreamed as it is of general utility and can be used to more easily automate the performed fits.
Open Data
The reconstructed ESP data is available on lilac
in the data-set-labelling/qc-esp
subdirectory while the AM1(BCC) charges are located in data-set-labelling/am1-charges
.
Open Science
Training the GNN models
As an initial pass to ensure the fitting infrastructure works correctly we proceed with hyperparameters selected in the original espaloma preprint. Namely, 3 GCN layers with 128 hidden features each, 4 NN layers with 128 hidden features each, a learning rate of 0.001 and a batch size of 256.
Training the BCCs
Test
* Espaloma appears to assign charges that sum to 0.0 even when the molecule has a non-zero total charge. Here we filter any charged molecules to give a fairer comparison.