2022-03-15 Protein-ligand benchmarks meeting notes

Participants

@Diego Nolasco (Deactivated)
@David Dotson
David Swenson
Irfan Alibay
@John Chodera
@Lorenzo D'Amore
@Richard Gowers
@Chapin Cavender

Goals

DD: Review fah-alchemy user stories #9 and #10 - discuss and gather live feedback
DD: Propose candidate data model - discuss and gather feedback
DD: Questions for JC – FAH project onboarding process?

Discussion topics

Item	Notes

Item	Notes
Procedural items	DN – No procedural items today
User story #9	CC – Want a small number of long trajectories, equilibrium JC – 10 microseconds or milliseconds? CC – microseconds CC – My priority would be to do a diversity of large systems rather than a few JC – This is already pretty well-supported. Would you be starting with a prepared structure? Would this need anything beyond what’s already offered? CC – We want a large amount of aggregate sampling. So like 7 ms total JC – could be outside scope of this particular project, since FAH already satisfies this use case CC – Also need to engineer pipeline for analysis, but I was planning on doing that outside of this framework. JC – have you seen trust-but-verify? CC – Yes JC – wonder if this may be a better match for one of the other FAH software projects; could get you running tomorrow if so (can contact “Sukrit?” in my lab) Sukrit Singh sukrit.singh@choderalab.org there are also other software projects that are trying to automate MSM generation, etc.; may be a better fit Sukrit can help connect you with those groups if you like CC – still in a planning stage; putting together proposal JC – as FF is intended to be self-consistent between proteins, small molecules, computing affinities would make sense CC – So it seems like this is already solved. JW – Could I see the input format for this? JC – The code isn’t public, you’ll need to get access to the GitHub org. IA – MDAnalysis is looking to make analysis packages CC – will reach out to Irfan and John when we decide what the protein systems are; can gather options for how to proceed with FAH JW – other reason I might want this to be supported by current infrastructure; what does the workflow look like? JC – suppose CC is doing large scale PL benchmarks; should that happen same way as large-scale protein benchmarks? does it make sense to have him set these up manually for a start? workflow engine within this project may be able to just support this use case, so could be done with the same system JW – less concerned about workflow nature; can we re-use that token for another system? JC – can train him to use FAH directly via a Chodera Lab work server; SSH access is all that’s needed for auth, plus training on use DD – CC, does this satisfy your requirements? CC – I think so. I’ll reach out to John and Sukrit. DD – I’m somewhat concerned about download throughput. We’ll have to pay a lot for downloading large trajectories from S3. So we’ll want to think about where we do the analysis since this could be a bottleneck. If we do the analysis ON aws then we don’t have to pay to download the trajectories. JC – We’ll need to do a similar thing for free energy calcs - It’ll be most efficient to analyze the data on AWS and just download the results.
User story #10	(see issue #10) need to be able to support relative free energies between point mutations of a protein; basically, instead of transforming the ligand, you transform the protein RG – like that this use case forces us to think about the network data model more broadly than we might otherwise JW – thinking about what a node “spec” would need for this, like maybe a tuple of (protein chemical identity, protein positions, ligand chemical identity, ligand positions) JC – would be careful to go too strict; need to be able to support e.g. protein dimers, cofactors, etc. thinking of this in terms of object models; openff topology object can feature multiple components perhaps you can tag these with labels that indicate what they are, and these are consumed by protocol executors think we have a lot of the objects we need to contain this information JW – depends on what we hope to do with node data; defining equality, what does it mean to hash them, etc. how do we make sure the node label has what it needs to achieve the comparisons? JC – comes out to what our definition of node is; is it a binding pose, for example?
Propose candidate data model - discuss and gather feedback	Slides DS – “Data model” slide – The choice to put FF in the edge spec instead of the node spec is interesting. DD – The transitions between nodes are free energy differences that need to be computed with an FF. JC – To do an analysis of a network, you’ll need to pull down all the edges that use the same force field. Also, for example, if you used different atom mappings to go between the same nodes, you’d need to be able to pull those out separately too to compare the mapping algorithms. DD – So, I expect that this will be a complex network, and users will need to filter out which nodes/edges they want to use. IA – A comment on what’s available - If we’re going to do something like bookending (a term from QM-MM where they treat different parts of a system differently), how would that be recorded? JC – Good question. This would change the meaning of a node. DD – JC and I discussed this yesterday. One way we could do this would be to allow “Self loops” (probably edged that go from one node to another JW – Can I have two identical edges between a pair of nodes? DD – We should avoid cases where edges are identical, but we should allow some cosmetic way to “break the hash”. DS – So the question is kinda “can one edge point to several transformations?” DD – The question is whether we need to enforce a one-to-one mapping between edges and submitted calculations IA – DS – I’d think that we should have a one-to-many RG – Maybe we should define an edge as “one attempt to measure this value”. Then we could also throw in experimental results into one of these networks. DD – So an edge would have one or more repeats? RG – … DD – … RG – An edge should result in a value and an uncertainty. IA – Another thing to consider is that initial conditions are important. So you may want to have different edges for different restraints. JW – Given that we’ll be writing specific methods to operate on the information content of nodes and edges, but we keep thinking of more things that we’ll need to add to nodes and edges, perhaps we should take an approach of assuming that we don’t know what all will be needed, and allow just about everything to be metadata. “self looping transformations” slide IA – I have some question about how I’d access the data from a self-loop, or how I’d know I need to access it. DD – good question IA – This was similar to JC’s earlier proposal of running a “sanity check” simulation before running the full calc. DD – We might consider whether this is a “need' for the architecture, since this can be worked around IA – This sort of simulation could add up to be a lot of compute cost DD – Good question DS – I’m thinking about how I’ll interact with this network with self-loops. DD – Downstream estimators that are estimating delta-Gs should ignore self-loops. Like, by default, a shortest-path algorithm would skip self-loops. But that doesn’t work if you NEED to use them for some sort of bookending. JW – I’m wondering if we should pack ALL the info into the nodes, and then when users need to get a simplified representation of the graph, they just say what they think defines a node and we return to them a graph with the nodes grouped that way. DD – That has a lot of implications. What are the nice features of having FFs on nodes? DS – You could have a transformation between two FFs with exactly the same protein and ligand, and get the deltaG of FF change. RG – I’d prefer to keep FF off the nodes, and solve this sort of thing with self-loops. JC – FF info could go on nodes, edges, or graph. We want to target the cleanest mix of implementation and usability. It’s possible that views could handle this complexity for you DS– A more detailed graph with lots of info on nodes can always be collapsed into a simpler graph by just pruning/grouping things. I advocate for putting FF on nodes. I think it’s easier to move info from nodes to edges than vice versa DD – One problem is that, if I wanted to use a node that does a representation of protein + openff-1.2.0 ligand and protein + openff-2.0.0 ligand, these would be separate nodes. DS – This would be helpful becuase this would let us reuse approaches elsewhere. DD – Are there advantages to putting FFs on edges? IA – It buys the process into an MD engine early on. So for example an ff19SB transformation could never use GROMACS to compute an edge (because CMAPS are supported) DD – I see the FF and engine as a kind of “implementation detail” of the transformations. JW – What would the world look like if we put everything conceivable on nodes, and just let edges only contain the info that absolutely doesn’t make sense to put on nodes? IA – Use of HMR? Where does that go? RG – It’s easier to reuse nodes if they’re simpler. JW will come up with arguments for “FFs on nodes” JW and DD will discuss pros and cons of different information contents in different locations Next meeting we will discuss and make a decision on “where do FFs go in this graph?”

Action items

@David Dotson and @Jeffrey Wagner will work together to assemble a list of pros/cons for FF information on nodes vs. on edges

2022-03-15 Protein-ligand benchmarks meeting notes

Participants

Goals

Discussion topics

Action items

Decisions