GNN Charge Models
Driver | |
Approver | @Lily Wang |
Contributors | @Simon Boothroyd Yuanqing Wang @Joshua Horton |
Informed | @David Mobley @Michael Shirts @Michael Gilson @Daniel Cole |
Objective | Develop a graph convolutional charge model that can produce AM1(BCC*) charges to within an accuracy comparable to the original AM1BCC model *BCCs need not be the original Bayly et al parameters |
Due date | |
Key outcomes | |
Status | in progress |
Problem Statement
Applying charge models to larger molecules, especially biopolymers, is currently intractable due to the cost of the QC calculations required. This leads to force fields often needing to define one charge model for proteins, another for small molecules and so on. Ideally a single, self consistent, charge model could be applied to all molecules within a simulation box. Further, charges for proteins are typically defined only for the standard amino-acids, and non-standard residues or post-translational modifications require a splicing, capping, and re-combination scheme which can be cumbersome to implement. A graph convolutional model on the other hand would simply be able to ingest the full protein, complete with any modifications, and yield a set of charges that are self-consistent with both the other residues and any ligands and solvent.
Scope
Must have: | |
|---|---|
Nice to have: | |
Not in scope: |
Workplan
Open Software
The software required to carry out this project spans most of the OpenFF stack given that changing the charge model will likely have large impacts on everything downstream (i.e. vdW, valence, …). It is expected that this project will require maintenance of and extensions to:
nagl |
|
|
|---|---|---|
OpenFF Recharge |
|
|
splore |
|
|
molesp |
|
|
OpenFF Evaluator |
|
|
nonbonded |
|
|
OpenFF Bespokefit |
|
|
absolv |
|
|
Open Data
The project will at minimum need a diverse train set of precomputed AM1(BCC) charges to train and test the model against as well as a test set of ESP data that is made publicly available via QCA.
We will also generate a data set of RESP charges using data in the OpenFF ESP Fragment Conformers v1.0 QCA data set.
Selecting the AM1(BCC) train / test set
The GNN charge model will be initially trained on the training sets of molecules assembled by Riniker and Bleiziffer (esp=78): https://doi.org/10.3929/ethz-b-000230799
The test set will be composed of molecules from the OpenFF Industry Benchmark Season 1 Publicset available on the QCA.
A validation set composed of the molecules found in the Enamine 10K diversity set will additionally be used, especially when performing hyperparameter sweeps
All molecule sets are additionally augmented with up to two protomers enumerated using the nagl prepare enumerate command.
See data-set-curation/qc-charges/submit-curate-partial-charge-set.sh and data-set-labelling/label-am1-charges.sh for additional details
ESP Level of theory
It was decided to compute the ESP at the HF/6-31G* level theory as is the current norm in the field. Although not perfect, it is not clear that another candidate that has the right balance of speed to compute and ‘accuracy' (defined in terms of how well do the final charges reproduce properties of interest e.g. Gsolv, Gbind). See