2022-03-18 JW | DD : FF on nodes vs. edges, pros and cons

Participants

Goals

For the following options, brainstorm pros and cons:
- forcefield information on nodes of an alchemical graph
- forcefield information on edges of an alchemical graph
- forcefield information on the alchemical graph itself

Discussion topics

Notes

DD – The third option above is a new idea that may help us identify the effects of connections. I want to give the working group the raw material for a decision. So our goal today isn’t to make the decision, but rather to list the pros and cons. Then we can make a decision at the meeting (even if we say “this is what we’ll assume for now”).
JW – I’m going to try to argue that any choice of putting info on edges is a SUBSET of what could be achieved by putting
What about the case where we have the SAME (to a human) protein+ligand combination, but there are minor differences (atom types, particle numbers, etc) that are required for it to be used with different FFs? Can one node represent multiple PDB files (or other structure definitions?)? It probably does, and so in any design, a human must be able to record in the data “these two unequal input files are the same thing”
JW – Picking a good or bad design can add complexity on its own, but there’s some complexity that will be in here regardless of our design. What is that complexity, and how can we force it to live in one place so we can reason about it?
- Nonidentical systems that humans consider equal ((“this atom typed protein is the same as that chemical protein”), or (“this protein-ligand system with explicit water is the same as that protein-ligand system that says ‘add a solvent box to me before running’”) )
- Rerunning the same transformation, even though it ran successfully the first time
- Force fields can be represented in different ways (a name, a string containing all contents, a pickled object, …)
- Choice of alchemical protocol, and the mapping of the protocol to 1 or more simulations
- Choice of free energy estimator in evaluating data on a single transformation
- Choice of graph free energy estimator in evaluating data for transformations in graph

forcefield information on nodes of an alchemical graph
- Pros
  - JW would argue that every other representation can be distilled down as a VIEW of a graph with FF info on nodes
  - Node definition more closely matches the concept of a “microstate”; includes atom identities, coordinates X and potential U(X)
  - If FF info is NOT on nodes, then all molecular representations in the graph MUST be chemical representations (no atom typed representations!).
  - Explicitly records human intent (“here’s a fake edge saying that these two input structures are the same” or “I’m giving these two nodes the same label because they represent the same chemistry, to me”)
- Cons
  - Node equality operator and view mechanism would need to learn how to compare FFs (which could be represented in a variety of formats)
  - For a network intended to benchmark many forcefields on its own, requires many more nodes instead of edges
  - Would we really ever model the transformation for the same chemical system transitioning from one FF to another? This use case is not high-priority.
    - that could also be supported by an edge protocol that does exactly this
  - Does it even/ever make sense to draw an edge between e.g. a node with gaff-2.11 and openff-1.3.0?
  - If we did the same network of transformations in two FFs, there would be no edges between the nodes with one FF to the nodes with another. We’d need to make assumptions about how the networks relate to each other
    - DD – effectively becomes option (3) implicitly for the use case of FF comparison
    - JW – Counterpoint: I could imagine saying network.view(identity=('protein','ligand')), and the nodes that use DIFFERENT ffs but the SAME protein+ligand (handwaving) would get merged in the final view I receive.
    - DD – I see. In form it looks similar to the third option. But this option allows the two graphs to be connected.
    - DD – Would it be helpful/necessary to have edges that represent “identity”?
forcefield information on edges of an alchemical graph
- Pros
  - JW – Human intent (eg “these two nonidentical input files (one with atom types and the other with elements+bond orders) are actually equal as far as I’m concerned”) is unambiguously recorded when the user adds two different structure files to the same node
    - because we’ve chosen to put the FF on the edges, it affords us more flexibility in defining the system on the node
    - If there are multiple structures on a node, the edges need to know which structure they should use to start simulations
  - Ties the choice of FF to the alchemical protocol, and therefore the dynamics, between two systems
  - Because it puts the FF alongside the protocol, allows for protocols that morph between more than one FF
- Cons
  - Due to separation of FF from chemical system on node, chemical system on node must be sufficiently general for parameterization into different FFs and engines as part of execution
    - this isn’t necessarily a bad thing, but may present challenges not present with associating the FF with the node
forcefield information on the alchemical graph itself
- Pros
  - None of the cons of either idea above, because FF info isn’t on either nodes or edges
- Cons
  - If we’re comparing FFs, then we can’t have a data structure that connects the two graphs - We have to make assumptions about how the nodes/edges of one graph map to the nodes/edges of another.

DD brainstorming notes

I am still leaning toward FF being a property of an edge; despite the challenges this presents for working generally across FFs, MD engines, and chemical systems, I think it’s worth driving in this direction because it avoids a lot of the contortions we have to do with the other approaches (contortions that indicate to me we’ve chosen a less-optimal abstraction as we could have). I’m happy to be challenged on this though.

JW brainstorming notes

Let’s define two terms:

A “starting point” - This is the concept of “a particular protein bound to a particular ligand in a particular pose”
A “structure object” is a specific data object that has one representation

We have a choice - Either

A “starting point” complex must be represented by one “structure object” ever, and
- all possible edges that could involve that node must be able to load that structure
  - Consequence: All “structure objects” must be chemical representations, because OpenFF can’t read anything else
  - Consequence: It will be extremely hard to manually intervene to resolve the very probable, very complex structure processing issues that occur in the edges. If manual intervention is allowed and becomes commonplace, there will be large data consistency/reproducibility issues. We must commit fully to the concept that technical issues ARE scientific issues if we move forward with this design.
A “starting point” complex can be represented by multiple nonidentical “structure objects”, and
- many “structure objects” are assumed to be identical if they belong to the same node, OR
- a node can only ever hold one “structure object” AND
  - Special edges must be added to the network to indicate that two nodes have identical “structure objects”, OR
  - each “structure object” has metadata which can be used to establish its identity and relation to other “structure objects”

Will resume at 11 AM Pacific (1 hour)

(General) – There’s information that we hadn’t talked about - That is the information about the assumptions we make during the transformation from a node’s single “canonical” structure representation to its “input file for a particular engine”.
- This info can live on the node (basically each node has multiple structures which are input files for different MD engines, OR each node has exactly one structure, and each MD engine might need to talk to one of many nodes that all represent the same chemical system)
- This info can live on the edge (each edge is a complete soup-to-nuts workflow)
- This info can live in the graph, if we allow edges that document assumptions made duiring structure prep, and nodes that are very similar to each other (differing only in structure prep , for example)
- DD – envision that an edge “has a” protocol, and that protocol encodes a DAG of operations that each have their own input parameters
  - if you want to run the same protocol on two nodes but with one parameter changed, you can do that.
  - should also be able to “diff” edges of the same protocol type for what input param choices differ

Summary: So, when we talk about “where should the FF go in the graph”, we’re really talking about something bigger: “Where does all the logic that prepares a system for a simulation using a specific FF go in the graph?”. The options are:

In the nodes (multiple nodes per “starting point”, each making assumptions for a different simulation engine/protocol):
- Pro
  - System prep assumptions are extremely specific, makes it easier to compare effect of switching only FF and nothing else
- Con
  - There will need to be additional information stored somewhere to say which nodes represent the same system, like labels on nodes saying “I’m actually one representation of this starting point” or special edges indicating “these nodes represent the same starting point”
In the edges (DD and JW advocate this choice):
- Pro
  - This truly measures the performance of an entire workflow
- Con
  - This doesn’t allow someone to isolate the effects of choosing a particular FF, since assumptions about structure prep/conversion are also included in the edge definition
  - We want a solution that minimizes and consolodates complexity. There are two sources of complexity - “Inherent complexity” due to the science of what we’re doing, and “design complexity” that arises by the choice of architecture. If edges can be seen as a function like pmx_workflow(node1, node2, ff, ...) , how many kwargs should that method have? A generous "put the complexity in the kwargs" philosphy puts the "scientific complexity" on the shoulders of developers (which it should!), but there's also some amount of "design complexity" that it additionally puts on them by requiring it to squeeze through the lens of "a python function with kwargs which are specific to a package version, and require us to start maintaining an API and worrying about reverse compatibility..."
  - It will be extremely hard to manually intervene to resolve the very probable, very complex structure processing issues that occur in the edges. If manual intervention is allowed and becomes commonplace, there will be large data consistency/reproducibility issues. We must commit fully to the concept that technical issues ARE scientific issues if we move forward with this design. So we may need to release entirely new versions of workflows to solve very simple bugs.
Implicit in the graph itself
- Pro:
  - Reduces the complexity stored in the nodes and edges themselves, one can analyze a whole graph with an understanding that certain assumptions were made throughout.
- Con:
  - The relationship between two networks, each calculating dGs between the same complexes, but using different protocols, will not have a clear mapping from the nodes of one to the nodes of another. That mapping will need to be stored externally.
One idea is to have clusters of nodes (imagine planet + moons) where a central node encodes the general form of a chemical system and the “moon” nodes encode the more engine/protocol specific version. Moon nodes from different clusters could be connected by transformation edges, which feature the protocol details that will operate on the moon nodes to execute the alchemical transformation.
- we think this could be a useful concept to try if we get stuck with the simpler approach of all engine/protocol-specific processing happening as part of edge protocol processing themselves.
- could also function as an optimization of the simpler approach, since multiple transformations could take advantage of the same preprocessed system instead of reprocessing it
- constitutes a bit of a hybrid approach to the question of FF/engine on nodes vs. edges
- comes at the cost of multiplied graph complexity
- DD – can make a diagram of this to communicate it and explore it more easily

Conclusion:

We considered separately the scenarios of placing the choice of forcefield as a property of the nodes, edges, or entire graph of an alchemical network. Despite all three cases being workable in principle, the most natural fit for this information appeared to be the edges, as part of the protocol an edge defines.

This places the burden of choosing how and when to parameterize the systems represented by two nodes on the edge protocol. This also avoids a proliferation of nodes with differing forcefields, or unconnected but largely identical graphs with corresponding nodes that differ only in choice of forcefield. It also yields a more natural result for connecting nodes with edges featuring experimental binding free energy data, for which force field choices on the nodes have no meaning.

Participants

Goals

Discussion topics

Action items

Decisions