2024-06-13 Protein FF meeting note

Participants

  • @Pavan Behara

  • @Chapin Cavender

  • @Anika Friedman

  • @Michael Gilson

  • Adam Hogan

  • @Alexandra McIsaac

  • @David Mobley

  • Louis Smith

  • @Lily Wang

  • @Jeffrey Wagner

  • @Brent Westbrook (Unlicensed)

  • @Michael Shirts

Goals

  • Update on QM fits using ForceBalance objective function with pairwise energy differences

  • Update on supplementing QM training data with PDB structures

Recording

2024-06-13-biopolymer-ff-meeting.mp4

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Pairwise energy differences

 

@Chapin Cavender

  • JW – Does it seem like the actual direction of gradients is wrong, or that the magnitude is way off (by like a factor of 100)

    • CC – I tried setting the scaling factor from 100 to 1 and still saw this issue.

  • LS – Could you clarify “grid point”? Might be good to try this change on a sinpler system to help debug.

    • CC – Grid points are individual values in a torsion scan (steps of 15 degrees).

    • CC – After seeing this problem on the whole dataset, I’m now debugging using a single peptide. Could possibly make it even smaller.

  • CC – Overall, I don’t think there are scientific issues here, just needs debugging.

  •  

PDB training data

@Anika Friedman

  • MG: looks like more points on wrong torsion plot (slide 9) than right hand side

    • AF: all side chain scans are concentrated on single points

  • MG – Are sidechain scans that appear to be overweighting alpha and beta basins affecting the backbone parameter fits?

    • CC – The fit is fitting all tagged parameters to all data (backbone AND sidechain scans). However the sidechain scans have the backbone torsions fixed to the same value every time, so the variation in those angles shouldn’t affect the result, especially in an Ab Initio fit with no optimizations

  • MG – What’s the arc/curve of points going across these plots?

    • AF + CC – we think it’s from CYX scans, where there’s a second residue present that ISN’T constrained.

    • MG – The curves seem to have 400ish points, which is as many points as I see in the grid.

    • AF – Each of the grid “points” on the 2d plot actually has many points (molecule dihedral scan points) at exactly the same coordinates.

  • MS – If we were overfitting to alpha, I’m surprised we were having trouble getting that right in simulation.

    • AF – Point distribution is about 1/3rd alpha, 1/3rd beta, and 1/3rd other (?)

    • MG – Relative energy is the important thing, so we need to have a balance of training points… Where do our sims go now?

    • CC – Sims largely go to alpha, delta, beta, and p2.

    • …?

    • MG – Do we want to sample confs from protein structures evenly or with some bias? Do we want to include dipeptides, tripeptides? Could make more QM data and try throwing it in training.

    • AF – Could do single point

    • DM – Are there cheaper benchmarks than protein stability experiment?

    • CC – I’ve been thinking about this. This has been a big blocker for the project. I’d hoped that short peptide benchmarks would suffice in this capacity but they don’t. So we could use a new type of benchmark.

    • DM – Maybe something with alphafold?

      • CC – Maybe… not sure how the details of this would work.

    • CC – Maybe something with reweighting a reference sim?

      • DM – That’s a good thought. MS, would reweighting work in this capacity?

      • PB (chat) I have another idea, can we also check AMBER FFSB14 training set distribution in these regions, don't know if ffsb14 training data is out there but I could find FB15 training data here, https://github.com/leeping/forcebalance/tree/master/Products/AMBER-FB15

      • MS – For small changes it should be OK. Eg you could make a small change in the FF and see how the weights change. Would need to have a good reference simulation that visits the confs of interest… You could try reweighting from a trajectory generated with ff14SB but it might be different in important ways. This might be a good idea. CSimmerling said ff14SB used reweighting for NMR fitting. Possibly worth trying.

      • LS – I like this idea a lot. If we’re worried about having the right confs in the dataset, you could bias them in eg using native contacts.

      • DM – Big picture we need to get unstuck by making this experiment where we check things much faster. So I’d advocate generating data for reweighting as a top priority.

      • MG – And this would be sampling GB3?

      •  

      • (General) – Yes, and possibly more proteins.

      • LS – Umbrella sampling in the native contacts would be great.

      • MS – Defining unfolded state gets really complicated.

      • CC – DM is talking about making tests run faster. LS is proposing fitting torsions to the data we have. What I’m proposing is, if we take a trajectory that unfolds, and use it to rescore, then we already have that data. To generate new data would be more work.

      • DM – I think this works as long as you have enough good structures in the current trajectory. If you don’t have enough good structures then the umbrella sampling thing could help.

      • MG – It’s hard since we don’t have reference results.

      • CC – We have NMR data.

      • MS – NMR data might have insufficient resolution, so MG might be right.

    • PB (chat) Tangential, if Chapin wants to test out more fitting ideas and is compute limited, or out of time to run all his ideas, I think we should have a system where he can delegate running different fits to the team by passing input files so that he can iterate faster

    • (General) – Can delegate sims to AF and PB

    • CC – I think implementing the pairwise fits should come first.

    • DM – That makes sense, but if there’s downtime the reweighting work would be a great thing to try.

    • MG – How does reweighting speed up our iterations?

      • DM – We wouldn’t need to rerun the simulations each time.

      • CC – Right, and we can get some measure of agreement with NMR data.

      • MG – Factor of speedup if we do this?

      • CC – Factor of ~1000x, since we’re just re-evaluating spaced frames. And since we’re only refitting dihedrals we don’t even need whole trajectory, could just do reevaluation of dihedral energies.

      • MS – Make sure to calculate number of effective samples, should be around… (technical details, see recording ~46 mins)

  • CC – Ok, so I’ll get pairwise objective funciton working, and meanwhile will reach out to AF and PB to distribute simulation work.

  • DM – And work with AF to implement reweighting if that’s helpful.

  •  

    •  

    •  

    •  

  •  

Action items

Decisions