2021-10-21 Meeting notes

Date

Oct 21, 2021

Participants

  • @Chapin Cavender

  • @Lily Wang

  • @Pavan Behara

  • @Simon Boothroyd

  • @David Mobley

Goals

  • Update on LiveCoMS review article

  • Discussion of OpenFF validation datasets for proteins

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

LiveCoMS review

@Chapin Cavender

External contributors are drafting text for sections on observables and on specific datasets

  • RT crystal sections were submitted last Friday (2021-10-15)

  • Solution NMR sections are expected this coming Friday (2021-10-22)

  • OpenFF PIs please block off some time in the next week or two to provide comments/edits on this draft (Google Drive link)

TODOs

  • Introduction will be drafted by @Chapin Cavender

  • Section on non-Bragg diffraction is incomplete and will be drafted by Michael Wall’s postdoc (and former Mobley lab member) David Wych

  • Plan for presenting experimental datasets in review article

    • For crystal datasets, is a PDB ID sufficient?

      • DM - some data are rerefined, so some data may need to be curated

      • CC - clone data in PDB to keep version consistent

    • For NMR datasets, what metadata is desired?

      • CC – This will be trickier, these are less standardized. One question is “what METADATA do we need for NMR datasets?”. We definitely need observables + molecules. But how do we handle cases where chemical shifts are only available for a subset of residues?

      • CC – I’d mentioned this at the last meeting with the NMR subgroup. Conclusion was that different datasets are standardized to different extents.

      • JW: We did this before with NIST data, would be a good reference

      • SB – This model would be good, but I wouldn’t copy it too closely (it was a bit rushed). What I’ve found is that, instead of just scripts, it’s better to do a python library that is USED by scripts to do the filtering. This can be a really high-value thing to get right the first time. Let’s gather JW + SB + CC to discuss in detail.

      • CC –

      • SB – It’s easy to lose track of rationale, so important to document clearly.

OpenFF protein validation datasets

@Chapin Cavender

Terminology

  • Validation datasets are used to choose between models and should be quick to compute

  • Test datasets are used to evaluate model performance and can be expensive to compute

Protein validation datasets

  • QC datasets

    • Could use subset of TorsionDrives for molecules not used to train parameters

    • E.g. train on dipeptide TorsionDrives for all 26 sidechains, then use Ace-Val-X-Val-Nme tetrapeptides for validation

    • No additional infrastructure needs

  • Experimental datasets

    • Crystal simulations are likely too expensive for validation, so focus on NMR observables for small peptides

      • 1H, 13C, and 15N chemical shifts

      • 3J Scalar couplings

      • Residual dipolar couplings

      • NOESY intensities

    • Specific datasets are expected from LiveCoMS review

    • Infrastructure needs - implement forward model in Evaluator or prepare input for external software?

      • JW – Alternatives here?

      • CC – There are existing tools of varying quality for predicting eg. chemical shifts

      • SB – The first step to implementing this in evaluator is “figuring out the workflow” – What exactly is the input? Then “what steps are needed to predict the observable?”. Once we know this, we’ll have a better idea for what the implementation will look like. After that we can wrap arbitrary software to accomplish the task. But it’s not unprecedented - A good example would be the host-guest targets in evaluator. Whether this happens inside or outside evaluator isn’t that important – It can be really flexible with plugins.

      • DM – Totally agree that wrap existing stuff is the way to go. To begin with we don’t want to be inventing stuff, just doing whatever is normal/standard/best practices or “good practices”. Later we can push the envelope.

      • DM – And totally agree that crystal simulations are super expensive. These are things where one might simulate hundreds of thousands to millions of atoms for hundreds of nanoseconds to microseconds. Definitely not validation.

      • CC – Two things:

        • For the validation sets I’m thinking of, the systems will be solvated 2-4 AA peptides. So the workflow will be “take an ensemble of PDB structures and get a trajectory, then predict the chemical shift”. So we won’t need to do docking or anything like that, and the simulations should be relatively short (just need to sample conformation landscape of these small peptides)

          • SB – Yeah, it’d be good to have a sort of sketch of exactly which tools we’d use and how the data flow would look.

        • We’ll need to do similar types of things for more expensive test datasets. Those would be bigger proteins + ligands. So I wonder if there’s some way we can be forward-thinking about the design we could reuse components

          • JW – I think there are enough unknowns that we shouldn’t aim to reuse designs/components off the bat. We’ll be hunting down edge cases for months. Also validation studies will likely be needed

          • SB – Agree

          • SB – Also keep in mind that, if we do try a bunch of different analysis methods, that we may want to design a way to reuse the trajectory info so we don’t have to rerun expensive sims.

      • CC – Andrew white published a NN method for determining chemical shifts. I don’t know how this will stack up to other options like ShiftX.

        • JW – I’d be cautious about the precision/accuracy of NMR validation – It’s possible that none of the tools are accurate enough to improve the FF in the regime we care about.

        • CC – Not clear that these are designed for trajectory analysis.

        • SB – What NMR observables/analyses do AMBER folks use? We should at least meet that as a standard.

        • CC – Agree. There are parameter sets for the NMR models that other folks have used, so we can be sure to use those as well to ensure our work is comparable.

  • Kirkwood-Buff integrals

    • Datasets from Table 3 of https://pubs.acs.org/doi/10.1021/acs.jctc.1c00075

    • Infrastructure needs - calculation of pairwise radial distribution functions

    • CC – In theory you need to do a grand canonical ensemble. In practice you can do a big simulation box and analyze only a subset of the box. Then you can just make an RDF that you pass on to subsequent analysis steps.

    • JW – This should be doable.

    • CC – May be implementations already available in MDAnalysis/MDTraj, so shouldn’t be a huge lift.

    • LW – I can confirm that RDFs are available in MDA.

Action items

Decisions