Objective	Derive force field parameters for proteins consistent with the OpenFF small molecule force field.
Primary Driver	Chapin Cavender
Approvers	Michael Gilson Michael Shirts
Supporting Drivers
Stakeholders (these people will be tagged in project update notifications on Slack)	Jeffrey Wagner Pavan Behara Iván Pulido Lily Wang Joshua Horton David Mobley
Project Manager	Diego Nolasco (Deactivated)
Page Owner (only this person can edit this page)	Chapin Cavender
Decision authority	Majority of Primary Driver and all Approvers, only in “Biopolymer FF call” meetings Veto authority: Primary Driver, any Approver
Discussion/notification venue	Fortnightly “Biopolymer FF call” meetings (decision forum). NOT “FF release call” meetings. #ff-biopolymers channel on OpenFF slack and “FF release call” meetings (notification and discussion, no major decisions allowed here. It is not assumed that meeting attendees have read the slack discussions, they must be summarized during meetings to be considered in decisions)
Meeting notes	2022 Protein FF meeting notes
Due date	2022-01-01
Key outcomes	Extensible SMIRNOFF format for amino acid residues Training datasets for 20 natural amino acids Selection of FF model for proteins One or more sets of OpenFF parameters for the 20 canonical amino acids Identification of key benchmark systems
Status	PLANNING PHASE

SMIRNOFF format

We need SMARTS strings that can specify protein-specific terms for general amino acids. To summarize the discussion here: https://openforcefield.atlassian.net/wiki/pages/createpage.action?spaceKey=MEET&title=2020-04-01%20AMBER%20FF%20porting%20meeting%20notes

Amber ff14SB was ported to SMIRNOFF format by using SMARTS strings that capture an entire amino acid, differentiating between main chain and terminal residues and between protonation/tautomeric states
Previous approach is not extensible for modified or synthetic amino acids
Need general SMARTS strings for backbone and side chain torsions in polypeptide chains
[#6X3](=O)-[#7X3:1]-[#6X4:2]-[#6X3:3](=O)-[#7X3H1:4]-[#6X4] will tag ψ for all residues except proline

Training datasets

This is a list of datasets that could be used to train protein-specific force field parameters.

Relevant small molecule datasets from Parsley

Needs to be determined. In particular, did Parsley train on dipeptides or tripeptides for any of the 20 canonical amino acids?

Dataset name	Dataset type	QC method	Molecules	QCA submission
Dipeptides
Tripeptides

Protein-specific datasets

Cerutti tetrapeptides are a set of 185 tetrapeptides with sidechains X-Y-X, X in [Ala, Gly, Ser, Val], and Y in [Ala, Arg, Ash, Asn, Asp, Cys, Glh, Gln, Glu, Gly, Hid, Hie] excluding (X == Ser && Y == Glu). David Cerutti selected multiple conformers for each tetrapeptide.

Dataset name	Dataset type	QC method	Molecules	QCA submission	Status
OpenFF Protein Fragments Initial	Optimization	B3LYP-D3BJ/def2-TVPP	16 tetrapeptides with sidechains X-Ala-X and X in [Ala, Gly, Ser, Val]	2020-07-06-OpenFF-Protein-Fragments-Initial	COMPLETE
~~OpenFF Protein Fragments version 2~~	Optimization	B3LYP-D3BJ/DZVP	Cerutti tetrapeptides with constraints to avoid hydrogen bonds	2020-08-12-OpenFF-Protein-Fragments-version2	SUPERSEDED
OpenFF Protein Peptide Fragments constrained v1.0	Optimization	B3LYP-D3BJ/DZVP	Cerutti tetrapeptides with constraints to avoid hydrogen bonds	2020-08-12-OpenFF-Protein-Fragments-version2	COMPLETE
OpenFF Protein Peptide Fragments unconstrained v1.0	Optimization	B3LYP-D3BJ/DZVP	Cerutti tetrapeptides with no constraints	2020-10-27-OpenFF-Protein-Fragments-Unconstrained	ERRORED
OpenFF Protein Fragments TorsionDrives v1.0	TorsionDrives on ϕ, ψ, ω, χ1, and χ2	B3LYP-D3BJ/DZVP	Cerutti tetrapeptides	2020-09-16-OpenFF-Protein-Fragments-TorsionDrives	ERRORED (22 / 845 errored)
OpenFF PEPCONF OptimizationDataset v1.0	Optimization	B3LYP-D3BJ/DZVP	PEPCONF dataset	2020-10-26-PEPCONF-Optimization	ERRORED (6000 / 7560 errored)
OpenFF Benchmark Ligands	TorsionDrive	B3LYP-D3BJ/DZVP	OpenFF FEP benchmark	2020-07-27-OpenFF-Benchmark-Ligands	COMPLETE

Model

We envision several tiers of models, presented below in order of increasing anticipated effort. We will generate and benchmark lower-effort models first and use the results of the benchmarks to inform decisions about higher-effort models.

Null Model

The null model is that the small molecule force field already describes proteins well and needs no protein-specific parameters. Valence parameters (bonds and angles), torsions, and Lennard-Jones parameters will be copied from the small molecule force field. Parsley was trained on compounds that resemble protein backbone and sidechain analogs, so these parameters are likely a good first pass at describing polypeptide chains. Partial charges could be derived in several ways:

Copy library charges from existing protein force field: Amber ff14SB (RESP, unchanged from Amber ff99) or Amber ff15ipq (IPolQ)
- Pro: easy to implement
- Pro: we know these work pretty well in the Amber context
- Con: we are no longer in the Amber context
Generate library charges for the 20 canonical amino acids (main chain and terminal) using AM1-BCC (RESP2)
- Pro: consistent with Parsley; for example, we want the parameters of a serine side chain to look a lot like those of ethanol, since we have reason to believe these parameters play well with the other parameters in the FF
- Pro: Lily Wang has evidence that AM1-BCC charges of fragments are similar (< 0.1 e) to the charges from a larger polymer (see https://zenodo.org/record/4977401#.YNuCk34pCpp )
- Con: more effort to generate
Generate charges on-the-fly using graph convolutional networks (see https://openforcefield.atlassian.net/wiki/pages/createpage.action?spaceKey=MEET&title=2020-04-01%20AMBER%20FF%20porting%20meeting%20notes)
- Maybe don’t take this approach unless/until it is also being used for the small molecules

Protein-specific Torsion (PST) Model

The Protein-specific Torsion Model includes protein-specific torsion terms, while other valence parameters (bonds and angles) and Lennard-Jones parameters will be copied from the small molecule force field. Charges will be derived using one of the methods described in the Null Model. Protein-specific torsions for the backbone torsions (ϕ, ψ, and ω) and sidechain torsions (χ1, maybe χ2) will be derived by fitting to QC datasets using dipeptides, tripeptides, and tetrapeptides. The dihedral angles with protein specific terms will be a subset of:

Flexible backbone dihedrals (ϕ and ψ)
Peptide bond dihedral (ω)
First sidechain dihedral (χ1)
Second sidechain dihedral (χ2)

A major decision for this model is which sidechains should have unique torsions that override the general peptide backbone torsion. We envision using the Protein Fragments Optimization and TorsionDrive datasets as the primary training data. Then, other datasets such as PEPCONF can be used as validation data to make decisions about the model. Alternatively, automated chemical perception (Chemical Perception) may be used to identify dihedrals that are not described well in the Parsley training set. The resulting model will likely be the candidate for the first protein force field release.

CMAP model

The CMAP model includes a protein-specific correction map (CMAP) that fits a 2D potential in ϕ and ψ to QC datasets. This model uses the same valence parameters, Lennard-Jones parameters, and charges as the Torsion Model. The largest obstacle to generating this model is specifying the CMAP potential in the SMIRNOFF format.

Protein-specific Torsions & Lennard-Jones (PSTLJ) Model

The Protein-specific Torsions & Lennard-Jones Model includes protein-specific terms for torsions (as in the PST Model) and protein-specific Lennard-Jones parameters. Valence parameters (bonds and angles) will be copied from the small molecule force field. Charges will be derived using one of the methods described in the Null Model. Protein-specific torsions will be derived in the same way as the PST Model or the CMAP model. Protein-specific Lennard-Jones parameters will be derived by fitting to experimental data for dipeptides or small molecule analogs. It is likely that some curation will be necessary to learn what data is available, but examples include:

Free energies of solvation in water
Enthalpies of mixing (see Binary Mixture Data Feasibility Study)
Dissociation constants for salt bridges (see https://pubs.acs.org/doi/abs/10.1021/jp500958r)

A major decision for this model is which set of Lennard-Jones types should be fit. This choice will be helped immensely by automated Lennard-Jones typing using Monte Carlo sampling, which is still under development.

Benchmarks

Experimental datasets are being curated to evaluate protein force fields. These datasets will be published as a LiveCoMS review, described here: /wiki/spaces/COMMS/pages/1927413777. It will be useful to identify a small number of key benchmarks that can interrogate distinct physical properties of proteins and that can be completed relatively quickly (~1 month). These key benchmarks will be used to evaluate force field models and make decisions about more complex models.

Milestones and deadlines

Milestone	Owner	Deadline	Status	Notes
Generate Null Model with Amber library charges	Chapin Cavender	2021-07-09	NOT STARTED
Generate Null Model with AM1-BCC library charges	Chapin Cavender	2021-08-01	NOT STARTED	Waiting on infrastructure for getting polymer charges from fragments
Choose key benchmarks to quickly evaluate force field models	Biopolymer FF group	2021-08-01	IN PROGRESS	In parallel with LiveCoMS review
Run key benchmarks for Null Models with library charges	Chapin Cavender Maybe others	2021-10-01	NOT STARTED
Decide on QC data for PST Model	Biopolymer FF group	2021-08-01	IN PROGRESS	Started by Dave Cerutti
Run QC calculations for PST Model	Chapin Cavender	2021-10-01	IN PROGRESS	Started by David Dotson and Trevor Gokey
Fit PST Model with one general term for all sidechains	Chapin Cavender	2021-11-01	NOT STARTED
Decide on sidechain-specific terms for PST Model	Biopolymer FF group	2021-11-01	NOT STARTED
Fit PST Model with sidechain-specific terms	Chapin Cavender	2022-01-01	NOT STARTED
Run key benchmarks for PST Models	Chapin Cavender Maybe others	2022-03-01	NOT STARTED
Generate charges using graph convolutional networks	Chapin Cavender	2022	NOT STARTED	Need update on feasibility from Chodera group
Fit CMAP model, if necessary	Chapin Cavender	2022	NOT STARTED	Waiting on CMAP infrastructure
Fit PSTLJ model, if necessary	Chapin Cavender	2022	NOT STARTED	Manual or automated LJ typing

Protein Force Field Project Plan