Objective | Derive force field parameters for proteins consistent with the OpenFF Develop a procedure to extend a general small molecule force field to also model proteins self-consistently. | ||||||
Primary Driver | |||||||
Approvers | |||||||
Supporting Drivers | |||||||
Stakeholders (these people will be tagged in project update notifications on Slack) | Jeffrey Wagner Pavan Behara Iván Pulido Lily Wang Joshua Horton David Mobley | ||||||
Project Manager | |||||||
Page Owner (only this person can edit this page) | |||||||
Decision authority | Unanimity Majority of Primary Driver and all Approvers (absences are vetos), only in “Biopolymer FF call” meetings Veto authority: Primary Driver, any Approver | ||||||
Discussion/notification venue | Fortnightly “Biopolymer FF call” meetings (decision forum). NOT “FF release call” meetings. #ff-biopolymers channel on OpenFF slack and “FF release call” meetings (notification and discussion, no major decisions allowed here. It is not assumed that meeting attendees have read the slack discussions, they must be summarized during meetings to be considered in decisions) | ||||||
Meeting notes | |||||||
Due date | 2022-01-01 | ||||||
Key outcomes |
| ||||||
Status |
|
Overview of strategy
Generate protein QC datasets for training and validation
Fit multiple models to the same training dataset
Models vary in types of torsion parameters
All models optimize valence parameters and torsion amplitudes
Benchmark protein models
Tier 1 benchmarks for all models
Tier 2 benchmarks for models that perform well in Tier 1. Specific failures in Tier 1 may lead to new models to address problems.
Tier 3 benchmarks for release candidate
SMIRNOFF format
We need SMARTS strings that can specify protein-specific terms for general amino acids. To summarize the full discussion here: https://openforcefield.atlassian.net/wiki/pages/createpage.action?spaceKey=MEET&title=2020-04-01%20AMBER%20FF%20porting%20meeting%20notes
Amber ff14SB was ported to SMIRNOFF format by using SMARTS strings that capture an entire amino acid, differentiating between main chain and terminal residues and between protonation/tautomeric states
Previous approach is not extensible for modified or synthetic amino acids
Need general SMARTS strings for backbone and side chain torsions in polypeptide chains
[#6X3](=O)-[#7X3:1]-[#6X4:2]-[#6X3:3](=O)-[#7X3H1:4]-[#6X4]
will tag ψ for all residues except proline
Training datasets
This is a list of datasets that could be used to train protein-specific force field parameters.
Relevant small molecule datasets from Parsley
Needs to be determined. In particular, did Parsley train on dipeptides or tripeptides for any of the 20 canonical amino acids?
...
Dataset name
...
Dataset type
...
QC method
...
Molecules
...
QCA submission
...
Dipeptides
...
Tripeptides
Protein-specific datasets
Cerutti tetrapeptides are a set of 185 tetrapeptides with sidechains X-Y-X
, X
in [Ala, Gly, Ser, Val]
, and Y
in [Ala, Arg, Ash, Asn, Asp, Cys, Glh, Gln, Glu, Gly, Hid, Hie]
excluding (X == Ser && Y == Glu)
. David Cerutti selected multiple conformers for each tetrapeptide.2022-06-30 Protein FF meeting note
Start with a general SMARTS for backbone and sidechain torsions, then overwrite with residue-specific SMARTS for exceptions
Backbone SMARTS should include the alpha carbon of adjacent residues to exclude uncapped termini
Sidechain SMARTS should not include atoms in adjacent residues to include both capped and uncapped termini
Protein QC datasets
Training QC datasets
These QC datasets will be used to supplement Sage QC training datasets.
Failures in TorsionDrives are caused by failure to converge the rotational component of the gradient. This will be fixed in the upcoming release of geomeTRIC.
Dataset name | Dataset type | QC method | Molecules | QCA submission | Status | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OpenFF Protein Fragments InitialCapped 1-mers 3-mers Optimization | Optimization | B3LYP-D3BJ/def2-TVPP | 16 tetrapeptides with sidechains X-Ala-X and X in [Ala, Gly, Ser, Val] |
|
| OptimizationDZVP | Ace-X-Nme and Ace-Y-X-Y-Nme X is 26 canonical amino acids and protomers Y is Ala or Val | OpenFF Protein Capped 1-mers 3-mers Optimization Dataset v1.0 | 756/759 complete | |||||||
OpenFF Protein Capped 1-mers Backbones | TorsionDrive | B3LYP-D3BJ/DZVP | Cerutti tetrapeptides with constraints to avoid hydrogen bonds |
| ||||||||||||
OpenFF Protein Peptide Fragments constrained v1.0 | Optimization | B3LYP-D3BJ/DZVP | Cerutti tetrapeptides with constraints to avoid hydrogen bonds |
| ||||||||||||
OpenFF Protein Peptide Fragments unconstrained v1.0 | Optimization | Ace-X-Nme X is 26 canonical amino acids and protomers 2-D scan of phi and psi Chi1 and chi2 constrained to most populated rotamer | 25/26 complete | |||||||||||||
OpenFF Protein Capped 1-mers Sidechains | TorsionDrive | B3LYP-D3BJ/DZVP | Cerutti tetrapeptides with no constraints |
| ||||||||||||
OpenFF Protein Fragments TorsionDrives v1.0 | TorsionDrives on ϕ, ψ, ω, χ1, and χ2 | B3LYP-D3BJ/DZVP | Cerutti tetrapeptides |
(22 / 845 errored) | ||||||||||||
OpenFF PEPCONF OptimizationDataset v1.0 | Optimization | B3LYP-D3BJ/DZVP |
(6000 / 7560 errored) | |||||||||||||
OpenFF Benchmark LigandsAce-X-Nme X is 26 canonical amino acids and protomers 2-D scan of chi1 and chi2 Phi and psi constrained to values in alpha helix or beta strand | 42/46 complete |
Validation QC datasets
These datasets will be used to choose between models for protein-specific parameters (see below).
Dataset name | Dataset type | QC method | Molecules | QCA submission | Status |
---|---|---|---|---|---|
OpenFF Protein Capped 3-mers Backbones | TorsionDrive | B3LYP-D3BJ/DZVP |
Status | ||||
---|---|---|---|---|
|
...
Ace-Y-X-Y-Nme X is 27 canonical amino acids and protomers including cisPro and transPro Y is Ala or Val 2-D scan of phi and psi Chi1 and chi2 constrained to most populated rotamer | 6/54 complete |
Protein-specific parameter models
We envision several tiers of models, presented below in order of increasing anticipated effortnumber of parameter types. We will generate and benchmark lower-effort simpler models first and use the results of the benchmarks to inform decisions about higher-effort models.how to prioritize fitting and benchmarking of more complex models. Benchmarking results for simpler models may also inspire new models not listed below to address specific benchmarking failures.
The same set of training datasets will be used for each model: Sage QC training dataset and protein QC datasets described above. Protein-specific datasets will be weighted equally for each of the twenty canonical sidechains (i.e. weights of all protomers for the same sidechain will sum to one), and the total weight for protein-specific datasets and small molecule datasets will be equal.
All valence parameters and torsion amplitudes will be optimized for each model, while Lennard-Jones parameters will be identical to Sage 2.0.0. The only difference between the models is the set of SMIRNOFF parameter types in the model. Initially, models will differ only in torsion parameter types. Additional changes, e.g. to Lennard-Jones types, will be considered only if major failures are observed in benchmarks with these models.
Charges will be modeled using ELF10 library charges derived by averaging over the flanking residues X and Z in Ace-Val-X-Y-Z-Nme. Library charges will be derived for the caps Ace and Nme and for the main chain, charged terminal, and uncharged terminal positions of the 26 canonical amino acids and common protomers. See additional details here: 2022-08-11 Protein FF meeting note
Null Model
The null model is that the small molecule force field already describes proteins well and needs no protein-specific parameters. Valence parameters (bonds and angles), torsions, and Lennard-Jones parameters will be copied from the small molecule force field. Parsley was trained on compounds that resemble protein backbone and sidechain analogs, so these parameters are likely a good first pass at describing polypeptide chains. Partial charges could be derived in several ways:
Copy library charges from existing protein force field: Amber ff14SB (RESP, unchanged from Amber ff99) or Amber ff15ipq (IPolQ)
Pro: easy to implement
Pro: we know these work pretty well in the Amber context
Con: we are no longer in the Amber context
Generate library charges for the 20 canonical amino acids (main chain and terminal) using AM1-BCC (RESP2)
Pro: consistent with Parsley; for example, we want the parameters of a serine side chain to look a lot like those of ethanol, since we have reason to believe these parameters play well with the other parameters in the FF
Pro: Lily Wang has evidence that AM1-BCC charges of fragments are similar (< 0.1 e) to the charges from a larger polymer (see https://zenodo.org/record/4977401#.YNuCk34pCpp )
Con: more effort to generate
Generate charges on-the-fly using graph convolutional networks (see https://openforcefield.atlassian.net/wiki/pages/createpage.action?spaceKey=MEET&title=2020-04-01%20AMBER%20FF%20porting%20meeting%20notes)
Maybe don’t take this approach unless/until it is also being used for the small molecules
Protein-specific Torsion (PST) Model
The Protein-specific Torsion Model includes protein-specific torsion terms, while other valence parameters (bonds and angles) and Lennard-Jones parameters will be copied from the small molecule force field. Charges will be derived using one of the methods described in the Null Model. Protein-specific torsions for the backbone torsions (ϕ, ψ, and ω) and sidechain torsions (χ1, maybe χ2) will be derived by fitting to QC datasets using dipeptides, tripeptides, and tetrapeptides. The dihedral angles with protein specific terms will be a subset of:
Flexible backbone dihedrals (ϕ and ψ)
Peptide bond dihedral (ω)
First sidechain dihedral (χ1)
Second sidechain dihedral (χ2)
A major decision for this model is which sidechains should have unique torsions that override the general peptide backbone torsion. We envision using the Protein Fragments Optimization and TorsionDrive datasets as the primary training data. Then, other datasets such as PEPCONF can be used as validation data to make decisions about the model. Alternatively, automated chemical perception (Chemical Perception) may be used to identify dihedrals that are not described well in the Parsley training set. The resulting model will likely be the candidate for the first protein force field release.
CMAP model
The CMAP model includes a protein-specific correction map (CMAP) that fits a 2D potential in ϕ and ψ to QC datasets. This model uses the same valence parameters, Lennard-Jones parameters, and charges as the Torsion Model. The largest obstacle to generating this model is specifying the CMAP potential in the SMIRNOFF format.
Protein-specific Torsions & Lennard-Jones (PSTLJ) Model
The Protein-specific Torsions & Lennard-Jones Model includes protein-specific terms for torsions (as in the PST Model) and protein-specific Lennard-Jones parameters. Valence parameters (bonds and angles) will be copied from the small molecule force field. Charges will be derived using one of the methods described in the Null Model. Protein-specific torsions will be derived in the same way as the PST Model or the CMAP model. Protein-specific Lennard-Jones parameters will be derived by fitting to experimental data for dipeptides or small molecule analogs. It is likely that some curation will be necessary to learn what data is available, but examples include:
Free energies of solvation in water
Enthalpies of mixing (see Binary Mixture Data Feasibility Study)
Dissociation constants for salt bridges (see https://pubs.acs.org/doi/abs/10.1021/jp500958r)
...
Amber ff99SB typed model
Backbone torsions
General backbone torsions for phi and psi
Residue-specific backbone torsions for Gly
Sidechain torsions
No protein-specific sidechain torsions
Beta-branched sidechains model
Backbone torsions
General backbone torsions for phi and psi
Residue-specific backbone torsions for Gly
Sidechain torsions
General sidechain torsions for chi1 and chi2
Residue-specific sidechain torsions for beta-branched sidechains (Ile, Thr, and Val)
Amber ff14SB typed model
Backbone torsions
General backbone torsions for phi and psi
Residue-specific backbone torsions for Gly
Sidechain torsions
General sidechain torsions for chi1 and chi2
Residue-specific sidechain torsions for 11 groups of sidechains from Amber ff14SB
Beta-branched backbone model
Backbone torsions
General backbone torsions for phi and psi
Residue-specific backbone torsions for Gly, Pro, and beta-branched sidechains (Ile, Thr, and Val)
Sidechain torsions
General sidechain torsions for chi1 and chi2
Residue-specific sidechain torsions for beta-branched sidechains (Ile, Thr, and Val)
Beta-branched backbones and Amber ff14SB typed sidechains model
Backbone torsions
General backbone torsions for phi and psi
Residue-specific backbone torsions for Gly, Pro, and beta-branched sidechains (Ile, Thr, and Val)
Sidechain torsions
General sidechain torsions for chi1 and chi2
Residue-specific sidechain torsions for 11 groups of sidechains from Amber ff14SB
Benchmarks
Experimental datasets are being curated to evaluate protein force fields. These datasets will be published as a LiveCoMS review, described here: /wiki/spaces/COMMS/pages/1927413777. It will be useful to identify a small number of key benchmarks that can interrogate distinct physical properties of proteins and that can be completed relatively quickly (~1 month). These key benchmarks will be used to evaluate force field models and make decisions about more complex models.
Protein force field benchmarks will target NMR observables. Three tiers of systems will be used to progressively assess force field models. Models that perform well in Tier 1 will progress to Tier 2, and only the release candidate will progress to Tier 3. Amber ff14SB will be benchmarked alongside OpenFF force fields for all tiers.
NMR observables
Tier 1
19 capped 1-mers (no Pro)
11 uncapped 3-mers (AAA, GGG, VVV, GAG, GEG ,GFG, GKG, GLG, GMG, GSG, GVG)
1 uncapped 4-mer (AAAA)
1 uncapped 5-mer (AAAAA)
K19 peptide (alpha helix)
CLN025 peptide (beta hairpin)
Tier 2
4 folded proteins (Ubiquitin, Lysozyme, GB3, BPTI)
10 disordered proteins (a99SB-disp benchmark dataset)
Tier 3
Additional 40 folded proteins (Mao benchmark dataset)
Milestones and deadlines
Milestone | Owner | Deadline | Status | Notes | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Generate Null Model with Amber ELF10 library charges | 20212022-0706-0901 | Choose key benchmarks to quickly evaluate force field models
| |||||||||||||||||||
Generate Null Model with AM1-BCC library charges | 2021-08-01 |
| Waiting on infrastructure for getting polymer charges from fragments | ||||||||||||||||||
| |||||||||||||||||||||
Choose benchmark systems | Biopolymer FF group | 20212022-0804-01 |
| In parallel with LiveCoMS reviewRun key benchmarks for Null Models with library charges | |||||||||||||||||
Generate protein QC datasets | Maybe others | 2021-102022-04-01 | 2021-08
| Decide on QC data for PST Model | Biopolymer FF group |
| |||||||||||||||
Fit parameters for at least two models (null and Amber ff99SB typed) | 2022-10-01 |
| |||||||||||||||||||
Started by Dave Cerutti | Run QC calculations for PST ModelSoftware for NMR observable benchmarks | 20212022-10-01 |
| Started by David Dotson and Trevor Gokey | |||||||||||||||||
Fit PST Model with one general term for all sidechains | Tier 1 NMR benchmarks | 20212022-11-01 |
| Decide on sidechain-specific terms for PST Model | |||||||||||||||||
Biopolymer FF group | 2021-11-01 |
| Fit PST Model with sidechain-specific terms | Tier 2 NMR benchmarks | 20222023-01-01 |
| |||||||||||||||
Run key benchmarks for PST ModelsTier 3 NMR benchmarks | Maybe others | 2022-03-01 |
| ||||||||||||||||||
Generate charges using graph convolutional networks | 2022 |
| Need update on feasibility from Chodera group | ||||||||||||||||||
Fit CMAP model, if necessary | 2022 | 2023-07-01 |
| Waiting on CMAP infrastructure | Fit PSTLJ model, if necessary | 2022 |
| Manual or automated LJ typing |