Notes |
---|
DD – The third option above is a new idea that may help us identify the effects of connections. I want to give the working group the raw material for a decision. So our goal today isn’t to make the decision, but rather to list the pros and cons. Then we can make a decision at the meeting (even if we say “this is what we’ll assume for now”). JW – I’m going to try to argue that any choice of putting info on edges is a SUBSET of what could be achieved by putting What about the case where we have the SAME (to a human) protein+ligand combination, but there are minor differences (atom types, particle numbers, etc) that are required for it to be used with different FFs? Can one node represent multiple PDB files (or other structure definitions?)? It probably does, and so in any design, a human must be able to record in the data “these two unequal input files are the same thing” JW – Picking a good or bad design can add complexity on its own, but there’s some complexity that will be in here regardless of our design. What is that complexity, and how can we force it to live in one place so we can reason about it? Nonidentical systems that humans consider equal ((“this atom typed protein is the same as that chemical protein”), or (“this protein-ligand system with explicit water is the same as that protein-ligand system that says ‘add a solvent box to me before running’”) ) Rerunning the same transformation, even though it ran successfully the first time Force fields can be represented in different ways (a name, a string containing all contents, a pickled object, …) Choice of alchemical protocol, and the mapping of the protocol to 1 or more simulations Choice of free energy estimator in evaluating data on a single transformation Choice of graph free energy estimator in evaluating data for transformations in graph
|
forcefield information on nodes of an alchemical graph Pros JW would argue that every other representation can be distilled down as a VIEW of a graph with FF info on nodes Node definition more closely matches the concept of a “microstate”; includes atom identities, coordinates X and potential U(X) If FF info is NOT on nodes, then all molecular representations in the graph MUST be chemical representations (no atom typed representations!). Explicitly records human intent (“here’s a fake edge saying that these two input structures are the same” or “I’m giving these two nodes the same label because they represent the same chemistry, to me”)
Cons Node equality operator and view mechanism would need to learn how to compare FFs (which could be represented in a variety of formats) For a network intended to benchmark many forcefields on its own, requires many more nodes instead of edges Would we really ever model the transformation for the same chemical system transitioning from one FF to another? This use case is not high-priority. Does it even/ever make sense to draw an edge between e.g. a node with gaff-2.11 and openff-1.3.0? If we did the same network of transformations in two FFs, there would be no edges between the nodes with one FF to the nodes with another. We’d need to make assumptions about how the networks relate to each other DD – effectively becomes option (3) implicitly for the use case of FF comparison JW – Counterpoint: I could imagine saying network.view(identity=('protein','ligand')) , and the nodes that use DIFFERENT ffs but the SAME protein+ligand (handwaving) would get merged in the final view I receive. DD – I see. In form it looks similar to the third option. But this option allows the two graphs to be connected. DD – Would it be helpful/necessary to have edges that represent “identity”?
forcefield information on edges of an alchemical graph Pros JW – Human intent (eg “these two nonidentical input files (one with atom types and the other with elements+bond orders) are actually equal as far as I’m concerned”) is unambiguously recorded when the user adds two different structure files to the same node because we’ve chosen to put the FF on the edges, it affords us more flexibility in defining the system on the node If there are multiple structures on a node, the edges need to know which structure they should use to start simulations
Ties the choice of FF to the alchemical protocol, and therefore the dynamics, between two systems Because it puts the FF alongside the protocol, allows for protocols that morph between more than one FF
Cons
forcefield information on the alchemical graph itself
|
DD brainstorming notes I am still leaning toward FF being a property of an edge; despite the challenges this presents for working generally across FFs, MD engines, and chemical systems, I think it’s worth driving in this direction because it avoids a lot of the contortions we have to do with the other approaches (contortions that indicate to me we’ve chosen a less-optimal abstraction as we could have). I’m happy to be challenged on this though.
|
JW brainstorming notes Let’s define two terms: We have a choice - Either A “starting point” complex must be represented by one “structure object” ever, and A “starting point” complex can be represented by multiple nonidentical “structure objects”, and many “structure objects” are assumed to be identical if they belong to the same node, OR a node can only ever hold one “structure object” AND Special edges must be added to the network to indicate that two nodes have identical “structure objects”, OR each “structure object” has metadata which can be used to establish its identity and relation to other “structure objects”
|
Will resume at 11 AM Pacific (1 hour) Summary: So, when we talk about “where should the FF go in the graph”, we’re really talking about something bigger: “Where does all the logic that prepares a system for a simulation using a specific FF go in the graph?”. The options are: In the nodes (multiple nodes per “starting point”, each making assumptions for a different simulation engine/protocol): In the edges (DD and JW advocate this choice): Pro Con This doesn’t allow someone to isolate the effects of choosing a particular FF, since assumptions about structure prep/conversion are also included in the edge definition We want a solution that minimizes and consolidates complexity. There are two sources of complexity - “Inherent complexity” due to the science of what we’re doing, and “design complexity” that arises by the choice of architecture. If edges can be seen as a function like pmx_workflow(node1, node2, ff, ...) , how many kwargs should that method have? A generous "put the complexity in the kwargs" philosphy puts the "scientific complexity" on the shoulders of developers (which it should!), but there's also some amount of "design complexity" that it additionally puts on them by requiring it to squeeze through the lens of "a python function with kwargs which are specific to a package version, and require us to start maintaining an API and worrying about reverse compatibility..." It will be extremely hard to manually intervene to resolve the very probable, very complex structure processing issues that occur in the edges. If manual intervention is allowed and becomes commonplace, there will be large data consistency/reproducibility issues. We must commit fully to the concept that technical issues ARE scientific issues if we move forward with this design. So we may need to release entirely new versions of workflows to solve very simple bugs.
Implicit in the graph itself Pro: Con: The relationship between two networks, each calculating dGs between the same complexes, but using different protocols, will not have a clear mapping from the nodes of one to the nodes of another. That mapping will need to be stored externally.
One idea is to have clusters of nodes (imagine planet + moons) where a central node encodes the general form of a chemical system and the “moon” nodes encode the more engine/protocol specific version. Moon nodes from different clusters could be connected by transformation edges, which feature the protocol details that will operate on the moon nodes to execute the alchemical transformation. we think this could be a useful concept to try if we get stuck with the simpler approach of all engine/protocol-specific processing happening as part of edge protocol processing themselves. could also function as an optimization of the simpler approach, since multiple transformations could take advantage of the same preprocessed system instead of reprocessing it constitutes a bit of a hybrid approach to the question of FF/engine on nodes vs. edges comes at the cost of multiplied graph complexity DD – can make a diagram of this to communicate it and explore it more easily
|
Conclusion: We considered separately the scenarios of placing the choice of forcefield as a property of the nodes, edges, or entire graph of an alchemical network. Despite all three cases being workable in principle, the most natural fit for this information appeared to be the edges, as part of the protocol an edge defines. This places the burden of choosing how and when to parameterize the systems represented by two nodes on the edge protocol. This also avoids a proliferation of nodes with differing forcefields, or unconnected but largely identical graphs with corresponding nodes that differ only in choice of forcefield. It also yields a more natural result for connecting nodes with edges featuring experimental binding free energy data, for which force field choices on the nodes have no meaning. |