2022-06-30 Protein FF meeting note

Participants

  • @Chapin Cavender

  • @Diego Nolasco (Deactivated)

  • @Pavan Behara

  • @Jeffrey Wagner

  • @Matt Thompson

  • @David Mobley

Goals

  • SMIRKS for protein-specific torsions

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Strategy for protein FF torsion fits

@Chapin Cavender

  • ELF10 library charges and Sage LJ parameters

  • QC data

    • Sage optimization datasets and TorsionDrives

    • Optimization dataset for capped 1-mers (Ace-X-Nme)

    • 2-D TorsionDrives on (phi, psi) for capped 1-mers

    • TorsionDrives on (chi1) or (chi1, chi2) for capped 1-mers

  • Outstanding questions

    • Fit torsion and valence parameters sequentially or simultaneously?

      • CC – I think our conclusion was that we’d do both and see which did better in validation sets

    • Fit torsion parameters to TorsionDrives only or TorsionDrives and optimizations?

      • CC – PB had been looking into this, it seems like things get better when torsion parameters are informed by optimization datastes.

      • PB – I think we should use, the optimization datasets.

      • CC – We’ll fit torsions to the sage training set, and protein-specific torsiondrives and optimizations.

      • CC – I wasn’t planning on using the enumerated sage protomers set that Pavan had made.

        • PB – I’d recommend that you do use it, it’s a pretty small set

        • CC – Ok, I’ll include it.

    • Weight all sidechains equally (total weight 20) or all protomers equally (total weight 26)?

      • Alanine has one protomer. It’s weight is 1.

      • Glutamate has two protomers. Should each protomer be weighted 1/2 or 1?

      • Histidine has three protomers. Should each protomer be weighted 1/3 or 1?

      • PB – When we fit optimized conformers, we just use 10 conformers each, and we weight all conformers equally.

      • CC – What about protomers?

      • PB – For torsion scans, we usually have distinct molecules, so we don’t have 5 torsions of the same molecule. So we don’t have precedent here.

      • CC – Do we have training mols where we’re ldouble-counting them when there’s an acid group that can be protonated?

      • JW – I think the weights targets from each AA should add up to 1, and the total weight should be 20. So molecules with multiple protonation states should have to share their

      • CC – Kinda agree, but will this come into conflict with small molecule targets?

      • JW – Will there be cases where a single molecule will get protein params and small molecule params, and the optimizer will need to “overfit” one to make the objective better, and kind of need to “pick a loser” between the small molecule + protein params. Because if that never happens then the relative weights won’t come into conflict

      • CC –

      • DM – In terms of keeping these datasets from “swamping” each other, we should try to carefully design the protein-specific torsion SMIRKS so they don’t overlap with small molecules. Because if we don’t do this, then subsequent dataset expansions/SMIRKS modification will need to be micromanaged to keep from throwing off the balance.

      • CC – Consensus seems to be that we will weight each of the 20 canonical AAs equally, no matter how many protomers they have. We want to make our protein SMIRKS specific enough that they won’t apply to any small molecules.

      • DM (chat) – Or at least protein backbone SMIRKS. Sidechains may be general enough… (?)

      • (General) – It would probably be good to make sure the total weights of the small molecule datasets and protein/biopolymer datasets are equal.

SMIRKS for protein-specific torsions

@Chapin Cavender

  • Backbone torsions (written assuming that lower SMIRKS overwrite upper SMIRKS)

    • General protein backbone: [#6X4]-[#6X3](=O)-[#7X3]-[#6X4]-[#6X3](=O)-[#7X3]-[#6X4]

    • Glycine: [#6X4]-[#6X3](=O)-[#7X3]-[#6X4H2]-[#6X3](=O)-[#7X3]-[#6X4]

    • Proline: [#6X4]-[#6X3](=O)-[#7R1X3]-[#6X4R1]-[#6X3](=O)-[#7X3]-[#6X4]

      • JW – Since there’s a chance that the protein-specific torsions could end up being really high-magnitude in order to correct for issues with the small molecule FF, I think that this should be strictly proline-specific.

    • Beta-branched (Ile, Thr, Val): [#6X4]-[#6X3](=O)-[#7X3]-[#6X4H1](-[#6X4H1])-[#6X3](=O)-[#7X3]-[#6X4]

    • CC – Right now I use the X decorator instead of counting Hs in a lot of cases. Do we think this is safe? This would mean that a chlorinated amino acid would get these parameters applied instead of getting general small molecule parameters.

      • PB – I think it’d be fine to use the same parameters for chlorinated AAs

      • JW – It will depend on how generalizable the parameters would be. If the protein-specific torsion ks end up similar in magnitude to the generic torsions, we'll have some confidence that they'll generalize.

      • DM – Agree, let’s fit using the current SMIRKS and then see how much they deviate from the “General” parameter to determine how safe it would be use allow for their more general use.

  • Sidechains torsions

    • General protein chi1: [#7X3]-[#6X4](-[#6X3]=O)-[#6X4]-[!#1X4]

    • General protein chi2: [#7X3]-[#6X4](-[#6X3]=O)-[#6X4]-[#6]~[!#1X4]

    • Beta-branched chi1: [#7X3]-[#6X4](-[#6X3]=O)-[#6X4H1]-[!#1X4]

    • Beta-branched chi2: [#7X3]-[#6X4](-[#6X3]=O)-[#6X4H1]-[#6]~[!#1X4]

    • CC – For aromatic side chains, should we refit the CB - CG torsion? Or will it be sufficient to use the general CX4-CX3 torsion for that?

      • JW – That’s a really hard question. The aromatic sidechain rotation landscape is probably dominated by sterics anyway

      • MT – Probably best to keep it simple at this stage, since that would be a really hard question to investigate. So I’d be in favor of not trying to add new torsion parameters for that.

      • CC – Agree. Some modern FFs are looking at fitting unique terms for each sidechain, but older FFs have sidechains sharing a lot of torsions. So right now we’ll try to use a few strategies for fitting/including sidechains, including

        • The ILDN approach (4 unique sidechain treatments)

        • The 14sb approach (12 unique sidechain treatments)

        • Treating everything generically as discussed above (2 unique sidechain treatments, discussed in this meetings)

    • CC – Are we OK with not extending the sidechain SMRIKS over the adjacent peptide bond? This means that we’ll treat terminal residue side chains the same as main chain residue side chains.

      • PB – Are there smaller changes that could be made to ensure that these sidechains only apply to proteins?

      • CC – This is the most specific that we can get without extending into capping atoms.

  • JW – How are the benchmark datasets looking?

    • CC – We’ve pretty much got those settled. so now it’s a matter of software.

  • JW – Also, FYI, we’re deciding whether to have the F@H interface replicate DHahns work precisely (like, using the exact same input structures and edges, which will be hard), or whether to just rerun the previously benchmarked FFs using the new infrastructure. If we use the new infrastructure, a likely outcome is that all results may be systematically worse, since the earlier sims had a lot of expert intervention, and the F@H infrastructure will do everything in an automated way. So we’re looking at how hard it would be to force DHahn’s exact settings into the OpenFE instructure, but if it’s difficult I’m going to propose that we just rerun everything with the understanding that an automated solution may return worse results but that at least we’ll be comparing apples to apples.

Action items

Decisions