Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Participants

Goals

  • Protocol for protein benchmark simulations of proteins in dilute solution

    • System setup

    • Dynamics

Discussion topics

Chapin Cavender

Item

Presenter

Notes

Protocol for protein benchmark simulations

Admin stuff

  • Change to “majority” instead of “unanimity” for decisions

    • MG + CC - Approve

  • MS is added as an approver to the project.

    • MG + CC - Approve

System setup

Chapin Cavender

  • Protein starting coordinates

    • Folded proteins

      • 1) PDB structure

      • 2) Randomly sample from published MD trajectory

      • JW - concern that sampling would be biased by original sim’s forcefields. PDB structure would be better.

      • MS – I like the idea of starting with existing MD trajectories, since we’ll really need the diversity of starting points.

      • CC – That would work, then in cases where trajectories aren’t available we could still start from a PDB structure

      • MG - Is this for folded proteins?

      • CC – So, for things like lysozyme that have frequently been simulated, we could randomly sample from those sims, but for other things we’d start from the PDb

      • MS – We probably wouldn’t escape the initial well in that case

      • MG – At an extreme, what if another trajectory visited an “unfolded” state. That would be another sort of “irrecoverable” situation

      • MS - Why not an ensemble of conformations with one of them being the PDB?

      • MS – It seems unlikely that other people would have published unfolding sims. Especially well-validated datasets that reproduce experiments.

      • MG – I’m warmer to the idea of trajectories that reproduce experimental (like NMR) data.

      • MS – Some questions about access/openness of DE Shaw datasets. We’d need to ask them for permisson to publish ~50 frames or something.

      • MG – Could publish some derivative set of snapshots from this traj (like, after a few steps of our own MD).

      • CC – I think it’ll be important to make the starting structures available, however we select them.

      • JW – Concerned that we might wind up in a situation where the sims don’t overlap. Would the data be usable?

      • MS – That would be a significant discovery in itself

      • CC – For reference, I’m looking at running more like 10 sims rather than 50

      • DM - If it’s hard to find data sets like this, I think it’s fine using PDB structures.

      • MG – That IS a typical usage scenario, whether or not it’s actually correct. At least it’s what everyone does.

      • CC – I’m in favor of using PDB structures.

      • MS – I’m not gonna hold you up on this - since it’s 2 to 1 let’s do the PDB starting structure.

      • Decision – We will start from PDB structure for folded protein, extended structures for disordered proteins.

    • Disordered proteins

      • 1) Extended structure (phi, psi = 180 deg, 180 deg)

      • 2) Randomly sample from PED ensemble

      • MG – How long are these peptides?

        • CC – Longest is 150 residues. Smaller ones are on the order of 50ish residues.

        • MS + MG – Both 50 and 150 are pretty big, will be hard to converge.

        • CC – DE shaw folks run these for 30 us and say that’s sufficient.

        • MG – Could we run an early experiment on using both of these as a starting point?

        • CC – It would burn a good amount of our compute time to do this.

        • MS – Not sure about this. It’s a tradeoff between compute time and wall time. Important to remember that for a first draft, we only need to do what other people have done.

        • MG – What have other people in the literature done?

        • CC – There’s not a ton of “best practices” here. Robustelli did PDB starting points, Shaw and Simmerling use extended confs as starting points. Some other folks do a PDB structure of a partially-resolved protein, and use modeling software for the toher.

        • CC – Some folks use NMR/SAXS observables to set constraints, others basically start by modeling things as random coils.

        • MG – Advantage of starting from a fully extended structure is that it’s reproducible.

        • CC – Largely agree. I think the extended structure is probably better.

        • DM – Only other thing to watch out for might be a project checkpoint where “these sims aren’t converging” - So basically we formalize a plan to bail out if these aren’t converging.

        • MG – How will we know that these aren’t converging?

        • DM – I don’t know how we’d define this, but I think it will be clear when we see it. I just want to avoid being in a situation where we don’t have a plan so we don’t WANT to look for possible problems.

        • CC – So, in the case where things are clearly not converging, we mention that loudly in the paper, but still publish the figures.

  • Thermodynamic state

    • Pressure, temperature, pH, ionic strength

    • 1) Same for all systems: 1 atm, 298 K, pH 7, 150 mM NaCl

    • 2) Match conditions of NMR experiment

      • Provenance of observables difficult to track

      • Observables for the same protein measured in experiments with different conditions

      • Some NMR experimental conditions difficult to model, e.g. D2O

      • CC – I’m in favor of option 1, otherwise we have to deal with a lot of complexity.

      • MG – Example of “provenance is difficult to track”?

      • CC – I tried to find the temperature of some lysozyme NMR expts. Took me 4 hours, details are buried in different places, lots of details sources from personal communications and other hard-to-track sources.

      • CC – Saw some different pHs (as low as 4) temps up to 320, buffers switched to sodium phosphate instead of sodium chloride

      • MG – I think the buffer change isn’t too bad, but the other ones are significant.

      • CC – Seems like most protein simulations just do standard conditions (option 1).

      • MG – Can we directly contact the LiveCOMS coauthors to ask about their experimental conditions? Since they’re the ones that are suggesting these datasets?

      • MS – Lindorff-Larsen expecially would know this.

      • JW – Could we exclude points that don’t fall within standard condiutions, and not try to replicate nonstandard conditions?

      • CC – That wouldn’t save most of the work - We’d still need to find the conditions for each experiment to know if we should exclude it. It’s nasty work but someone in our field needs to do it.

      • MG – What do you mean by “same measurement under different experimental conditions?”

      • CC – Like, the backbone NMR and the sidechain NMR would be taken under different conditions.

      • MG + MS – The most straightforward route would be to ask the coauthors about their experimental conditions. Then we can exclude systems with different or unknown conditions.

      • CC – Agree. I thnk the current practices in the field are based on things like “we know TIP3P is not accurate, these other sources of error are less signficiant than that, let’s just use standard conditions for everything.”

      • JW – I’m concerned that one branch of this plan is “most people can’t provide experimental conditions and it takes a month to learn that, and then we don’t have a way to move forward”

      • CC – I’ll need to do benchmarking for about a month anyway.

      • MG – Ok, so let’s email the coauthors, and consider the plan of action here once we have their responses and our benchmarks

      • Decision – CC – I will pick a particular state to start running the timing benchmarks, and I will reahc out to coauthors to collect info on experimental conditions

  • Solvent

    • Rhombic dodecahedron with 10 Å padding for folded proteins and 12 Å padding for disordered proteins

    • Neutralize with counterions then add ion pairs to match bulk salt concentration using SLTCAP

    • Decision – No objections to this plane - We will do this

  • Protomers and tautomers

    • 1) Most populated protonation state at pH, choose HIE for neutral HIS

    • 2) Assign protomers from H++

      • JW – If we’re starting from eg. extended states of the proteins, then geometry-based protonation methods could mislead us/bias us toward a known “bad” starting point.

      • MG + MS – Not sure how “good” H++ is

      • MS – Different conformations can give different protonation states, but H++ has been consistent within 1 pH unit

      • MG – It’s easy to do protonation calcs that give protonation shifts that are much larger than what’s been seen in expts.

      • DM – My experience with these protonation tools is that they do lead to substantially better results than just assigning default protonation states. And these are widely-used tools, so the choice will be defensible.

      • MG – Then I’d vote that we sanity-check H++, and go ahead with it if there are no obvious deficiencies

      • Decision – CC will use H++ or an equivalent tool at his discretion. CC will not assign standard protonation states to all AAs.

Dynamics

  • Nonbonded model - match SMIRNOFF spec for Sage

    • PME electrostatics with 9 Å cutoff and 5/6 1-4 scaling

      • JW – I’m not sure if PME + X cutoff makes sense

      • CC – I think this is where the method switched from a direct space toa recipricol space sum. For biological systems I don’t think it matters much. This is just a performance consideration.

      • JW – This is the subject of an ongoing discussion about the SMIRNOFF spec, I think that PME cutoff doesn’t make any difference to the system energy. The

    • 12-6 LJ with 9 Å cutoff, 1 Å switch width, and 1/2 1-4 scaling

  • Integrator

    • Langevin with frequency 1 ps−1

    • Time step 2 fs with SHAKE

  • Barostat

    • Monte Carlo barostat with frequency 100 steps

  • Equilibration

    • 1 ns equilibration with 1 fs timestep, 5 ps−1 Langevin frequency, 20 step barostat frequency, and 1 kcal mol−1 Cartesian restraints on protein atoms

    • Post-process production trajectory to determine burn-in time

Action items

  •  

Decisions