GNN Charge Models

GNN Charge Models

Driver

Approver

@Lily Wang

Contributors

@Simon Boothroyd Yuanqing Wang @Joshua Horton

Informed

@David Mobley @Michael Shirts @Michael Gilson @Daniel Cole

Objective

Develop a graph convolutional charge model that can produce AM1(BCC*) charges to within an accuracy comparable to the original AM1BCC model

*BCCs need not be the original Bayly et al parameters

Due date

Key outcomes

Status

in progress

 

Problem Statement

Applying charge models to larger molecules, especially biopolymers, is currently intractable due to the cost of the QC calculations required. This leads to force fields often needing to define one charge model for proteins, another for small molecules and so on. Ideally a single, self consistent, charge model could be applied to all molecules within a simulation box. Further, charges for proteins are typically defined only for the standard amino-acids, and non-standard residues or post-translational modifications require a splicing, capping, and re-combination scheme which can be cumbersome to implement. A graph convolutional model on the other hand would simply be able to ingest the full protein, complete with any modifications, and yield a set of charges that are self-consistent with both the other residues and any ligands and solvent.

Scope

Must have:

  •  

Must have:

  •  

Nice to have:

  •  

Not in scope:

  •  

 

Workplan

Open Software

The software required to carry out this project spans most of the OpenFF stack given that changing the charge model will likely have large impacts on everything downstream (i.e. vdW, valence, …). It is expected that this project will require maintenance of and extensions to:

nagl

  • Define and train GNN charge models with flexible atom features

OpenFF Recharge

  • Reconstruct ESP and EF data from QCA records

  • Estimate ESP / EF using a FF model.

  • Training BCC parameters on top of the GNN charge model and exporting these into a SMIRNOFF force field

  • Generating RESP charges to serve as a ‘reference model’

  • SMIRKS representation of AM1BCC BCCs

splore

  • Easily visualise data sets of molecules either local or from QCA

molesp

  • Rapidly compute / visualise the ESP on the vdW surface of a molecule

OpenFF Evaluator

  • Used to compute the training set of properties while training the vdW parameters

nonbonded

  • Automate the set-up of training the vdW parameters against the phys-prop data

OpenFF Bespokefit

  • Automate the set-up of training the valence parameters against QC data

absolv

  • Benchmark FF with v-sites against solvation / hydration free energy data

Open Data

The project will at minimum need a diverse train set of precomputed AM1(BCC) charges to train and test the model against as well as a test set of ESP data that is made publicly available via QCA.

We will also generate a data set of RESP charges using data in the OpenFF ESP Fragment Conformers v1.0 QCA data set.

Selecting the AM1(BCC) train / test set

The GNN charge model will be initially trained on the training sets of molecules assembled by Riniker and Bleiziffer (esp=78): https://doi.org/10.3929/ethz-b-000230799

The test set will be composed of molecules from the OpenFF Industry Benchmark Season 1 Publicset available on the QCA.

A validation set composed of the molecules found in the Enamine 10K diversity set will additionally be used, especially when performing hyperparameter sweeps

All molecule sets are additionally augmented with up to two protomers enumerated using the nagl prepare enumerate command.

See data-set-curation/qc-charges/submit-curate-partial-charge-set.sh and data-set-labelling/label-am1-charges.sh for additional details

ESP Level of theory

It was decided to compute the ESP at the HF/6-31G* level theory as is the current norm in the field. Although not perfect, it is not clear that another candidate that has the right balance of speed to compute and ‘accuracy' (defined in terms of how well do the final charges reproduce properties of interest e.g. Gsolv, Gbind). See