Page Comparison

Participants

Goals

SMIRKS for protein-specific torsions

...

Item

Presenter

Notes

Strategy for protein FF torsion fits

Chapin Cavender

ELF10 library charges and Sage LJ parameters
QC data
- Sage optimization datasets and TorsionDrives
- Optimization dataset for capped 1-mers (Ace-X-Nme)
- 2-D TorsionDrives on (phi, psi) for capped 1-mers
- TorsionDrives on (chi1) or (chi1, chi2) for capped 1-mers
Outstanding questions
- Fit torsion and valence parameters sequentially or simultaneously?
  - CC – I think our conclusion was that we’d do both and see which did better in validation sets
- Fit torsion parameters to TorsionDrives only or TorsionDrives and optimizations?
  - CC – PB had been looking into this, it seems like things get better when torsion parameters are informed by optimization datastes.
  - PB – I think we should use, the optimization datasets.
  - CC – We’ll fit torsions to the sage training set, and protein-specific torsiondrives and optimizations.
  - CC – I wasn’t planning on using the enumerated sage protomers set that Pavan had made.
    - PB – I’d recommend that you do use it, it’s a pretty small set
    - CC – Ok, I’ll include it.
      - Github link macro
        link https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2021-12-21-OpenFF-Gen2-Optimization-Set-Protomers
- Weight all sidechains equally (total weight 20) or all protomers equally (total weight 26)?
  - Alanine has one protomer. It’s weight is 1.
  - Glutamate has two protomers. Should each protomer be weighted 1/2 or 1?
  - Histidine has three protomers. Should each protomer be weighted 1/3 or 1?
  - PB – When we fit optimized conformers, we just use 10 conformers each, and we weight all conformers equally.
  - CC – What about protomers?
  - PB – For torsion scans, we usually have distinct molecules, so we don’t have 5 torsions of the same molecule. So we don’t have precedent here.
  - CC – Do we have training mols where we’re ldouble-counting them when there’s an acid group that can be protonated?
  - JW – I think the weights targets from each AA should add up to 1, and the total weight should be 20. So molecules with multiple protonation states should have to share their
  - CC – Kinda agree, but will this come into conflict with small molecule targets?
  - JW – Will there be cases where a single molecule will get protein params and small molecule params, and the optimizer will need to “overfit” one to make the objective better, and kind of need to “pick a loser” between the small molecule + protein params. Because if that never happens then the relative weights won’t come into conflict
  - CC –
  - DM – In terms of keeping these datasets from “swamping” each other, we should try to carefully design the protein-specific torsion SMIRKS so they don’t overlap with small molecules. Because if we don’t do this, then subsequent dataset expansions/SMIRKS modification will need to be micromanaged to keep from throwing off the balance.
  - CC – Consensus seems to be that we will weight each of the 20 canonical AAs equally, no matter how many protomers they have. We want to make our protein SMIRKS specific enough that they won’t apply to any small molecules.
  - DM (chat) – Or at least protein backbone SMIRKS. Sidechains may be general enough… (?)
  - (General) – It would probably be good to make sure the total weights of the small molecule datasets and protein/biopolymer datasets are equal.

SMIRKS for protein-specific torsions

Chapin Cavender

Backbone torsions (written assuming that lower SMIRKS overwrite upper SMIRKS)
- General protein backbone: [#6X4]-[#6X3](=O)-[#7X3]-[#6X4]-[#6X3](=O)-[#7X3]-[#6X4]
- Glycine: [#6X4]-[#6X3](=O)-[#7X3]-[#6X4H2]-[#6X3](=O)-[#7X3]-[#6X4]
- Proline: [#6X4]-[#6X3](=O)-[#7R1X3]-[#6X4R1]-[#6X3](=O)-[#7X3]-[#6X4]
  - JW – Since there’s a chance that the protein-specific torsions could end up being really high-magnitude in order to correct for issues with the small molecule FF, I think that this should be strictly proline-specific.
- Beta-branched (Ile, Thr, Val): [#6X4]-[#6X3](=O)-[#7X3]-[#6X4H1](-[#6X4H1])-[#6X3](=O)-[#7X3]-[#6X4]
- CC – Right now I use the X decorator instead of counting Hs in a lot of cases. Do we think this is safe? This would mean that a chlorinated amino acid would get these parameters applied instead of getting general small molecule parameters.
  - PB – I think it’d be fine to use the same parameters for chlorinated AAs
  - JW – It will depend on how generalizable the parameters would be. If the protein-specific torsion ks end up similar in magnitude to the generic torsions, we'll have some confidence that they'll generalize.
  - DM – Agree, let’s fit using the current SMIRKS and then see how much they deviate from the “General” parameter to determine how safe it would be use allow for their more general use.
Sidechains torsions
- General protein chi1: [#7X3]-[#6X4](-[#6X3]=O)-[#6X4]-[!#1X4]
- General protein chi2: [#7X3]-[#6X4](-[#6X3]=O)-[#6X4]-[#6]~[!#1X4]
- Beta-branched chi1: [#7X3]-[#6X4](-[#6X3]=O)-[#6X4H1]-[!#1X4]
- Beta-branched chi2: [#7X3]-[#6X4](-[#6X3]=O)-[#6X4H1]-[#6]~[!#1X4]
- CC – For aromatic side chains, should we refit the CB - CG torsion? Or will it be sufficient to use the general CX4-CX3 torsion for that?
  - JW – That’s a really hard question. The aromatic sidechain rotation landscape is probably dominated by sterics anyway
  - MT – Probably best to keep it simple at this stage, since that would be a really hard question to investigate. So I’d be in favor of not trying to add new torsion parameters for that.
  - CC – Agree. Some modern FFs are looking at fitting unique terms for each sidechain, but older FFs have sidechains sharing a lot of torsions. So right now we’ll try to use a few strategies for fitting/including sidechains, including
    - The ILDN approach (4 unique sidechain treatments)
    - The 14sb approach (12 unique sidechain treatments)
    - Treating everything generically as discussed above (2 unique sidechain treatments, discussed in this meetings)
- CC – Are we OK with not extending the sidechain SMRIKS over the adjacent peptide bond? This means that we’ll treat terminal residue side chains the same as main chain residue side chains.
  - PB – Are there smaller changes that could be made to ensure that these sidechains only apply to proteins?
  - CC – This is the most specific that we can get without extending into capping atoms.
JW – How are the benchmark datasets looking?
- CC – We’ve pretty much got those settled. so now it’s a matter of software.
JW – Also, FYI, we’re deciding whether to have the F@H interface replicate DHahns work precisely (like, using the exact same input structures and edges, which will be hard), or whether to just rerun the previously benchmarked FFs using the new infrastructure. The new numbers may be systematically worse, since the earlier sims had a lot of expert intervention

Versions Compared

Old Version 1

New Version 2

Key

Participants

Goals

Action items

Decisions