User story #9
| CC – Want a small number of long trajectories, equilibrium JC – 10 microseconds or milliseconds? CC – My priority would be to do a diversity of large systems rather than a few JC – This is already pretty well-supported. Would you be starting with a prepared structure? Would this need anything beyond what’s already offered? CC – We want a large amount of aggregate sampling. So like 7 ms total JC – could be outside scope of this particular project, since FAH already satisfies this use case
CC – Also need to engineer pipeline for analysis, but I was planning on doing that outside of this framework. JC – wonder if this may be a better match for one of the other FAH software projects; could get you running tomorrow if so (can contact “Sukrit?” in my lab) Sukrit Singh sukrit.singh@choderalab.org there are also other software projects that are trying to automate MSM generation, etc.; may be a better fit Sukrit can help connect you with those groups if you like CC – still in a planning stage; putting together proposal JC – as FF is intended to be self-consistent between proteins, small molecules, computing affinities would make sense
CC – So it seems like this is already solved. JW – Could I see the input format for this? IA – MDAnalysis is looking to make analysis packages CC – will reach out to Irfan and John when we decide what the protein systems are; can gather options for how to proceed with FAH JW – other reason I might want this to be supported by current infrastructure; what does the workflow look like? JW – less concerned about workflow nature; can we re-use that token for another system? DD – CC, does this satisfy your requirements? DD – I’m somewhat concerned about download throughput. We’ll have to pay a lot for downloading large trajectories from S3. So we’ll want to think about where we do the analysis since this could be a bottleneck. If we do the analysis ON aws then we don’t have to pay to download the trajectories.
|
User story #10 | (see issue #10) RG – like that this use case forces us to think about the network data model more broadly than we might otherwise JW – thinking about what a node “spec” would need for this, like maybe a tuple of (protein chemical identity, protein positions, ligand chemical identity, ligand positions) JC – would be careful to go too strict; need to be able to support e.g. protein dimers, cofactors, etc. JW – depends on what we hope to do with node data; defining equality, what does it mean to hash them, etc. how do we make sure the node label has what it needs to achieve the comparisons? JC – comes out to what our definition of node is; is it a binding pose, for example?
|
Propose candidate data model - discuss and gather feedback | Slides DS – “Data model” slide – The choice to put FF in the edge spec instead of the node spec is interesting. DD – The transitions between nodes are free energy differences that need to be computed with an FF. JC – To do an analysis of a network, you’ll need to pull down all the edges that use the same force field. Also, for example, if you used different atom mappings to go between the same nodes, you’d need to be able to pull those out separately too to compare the mapping algorithms. DD – So, I expect that this will be a complex network, and users will need to filter out which nodes/edges they want to use.
IA – A comment on what’s available - If we’re going to do something like bookending (a term from QM-MM where they treat different parts of a system differently), how would that be recorded? JC – Good question. This would change the meaning of a node. DD – JC and I discussed this yesterday. One way we could do this would be to allow “Self loops” (probably edged that go from one node to another
JW – Can I have two identical edges between a pair of nodes? DD – We should avoid cases where edges are identical, but we should allow some cosmetic way to “break the hash”. DS – So the question is kinda “can one edge point to several transformations?” DD – The question is whether we need to enforce a one-to-one mapping between edges and submitted calculations IA – DS – I’d think that we should have a one-to-many RG – Maybe we should define an edge as “one attempt to measure this value”. Then we could also throw in experimental results into one of these networks. DD – So an edge would have one or more repeats? RG – … DD – … RG – An edge should result in a value and an uncertainty. IA – Another thing to consider is that initial conditions are important. So you may want to have different edges for different restraints. JW – Given that we’ll be writing specific methods to operate on the information content of nodes and edges, but we keep thinking of more things that we’ll need to add to nodes and edges, perhaps we should take an approach of assuming that we don’t know what all will be needed, and allow just about everything to be metadata.
“self looping transformations” slide IA – I have some question about how I’d access the data from a self-loop, or how I’d know I need to access it. DD – good question IA – This was similar to JC’s earlier proposal of running a “sanity check” simulation before running the full calc. DD – We might consider whether this is a “need' for the architecture, since this can be worked around IA – This sort of simulation could add up to be a lot of compute cost DD – Good question
DS – I’m thinking about how I’ll interact with this network with self-loops. DD – Downstream estimators that are estimating delta-Gs should ignore self-loops. Like, by default, a shortest-path algorithm would skip self-loops. But that doesn’t work if you NEED to use them for some sort of bookending.
JW – I’m wondering if we should pack ALL the info into the nodes, and then when users need to get a simplified representation of the graph, they just say what they think defines a node and we return to them a graph with the nodes grouped that way. DD – That has a lot of implications. What are the nice features of having FFs on nodes? DS – You could have a transformation between two FFs with exactly the same protein and ligand, and get the deltaG of FF change. RG – I’d prefer to keep FF off the nodes, and solve this sort of thing with self-loops. JC – FF info could go on nodes, edges, or graph. We want to target the cleanest mix of implementation and usability. It’s possible that views could handle this complexity for you DS– A more detailed graph with lots of info on nodes can always be collapsed into a simpler graph by just pruning/grouping things. I advocate for putting FF on nodes. I think it’s easier to move info from nodes to edges than vice versa DD – One problem is that, if I wanted to use a node that does a representation of protein + openff-1.2.0 ligand and protein + openff-2.0.0 ligand, these would be separate nodes. DS – This would be helpful becuase this would let us reuse approaches elsewhere.
DD – Are there advantages to putting FFs on edges? IA – It buys the process into an MD engine early on. So for example an ff19SB transformation could never use GROMACS to compute an edge (because CMAPS are supported) DD – I see the FF and engine as a kind of “implementation detail” of the transformations.
JW – What would the world look like if we put everything conceivable on nodes, and just let edges only contain the info that absolutely doesn’t make sense to put on nodes? RG – It’s easier to reuse nodes if they’re simpler. JW will come up with arguments for “FFs on nodes” JW and DD will discuss pros and cons of different information contents in different locations Next meeting we will discuss and make a decision on “where do FFs go in this graph?”
|