DD – Alchemsicale 104 - result path conversion and upload - Worked with DS and spotted a slight structural mismatch between alchemiscale and gufe - alchemiscale only passes REFERENCES to actual objects in the object store between server and workers. Whereas GUFE passes the actual objects around. So I think it makes sense to refactor Alchemiscale to follow the GUFE conventions. Opened alchemiscale 180 with a plan for this.
JC – All of this complexity could be resolved by making the protocol units short.
DS – DD isn’t quite right - We don’t upload as you go by default - instead we save all the files and only send a summary json around. The reason we do this is so that we can later do things in parallel
JC – Two things:
DS – We decided that a single DAG in GUFE was going to be capable of statistics, so a single DAG can have 3 copies.
JC – So a DAG can have 3 parts, why would those need to be ordered sequentially? Why can’t you run it in parallel?
DS – This was discussed internally a long time ago and this has already been decided.
JW – I’d appreciate a description of this from the beginning
DD – A protocol defined a protocoldag by way of its settings. Executing a single rpotocoldag should be able to get you a dG (DS: and uncertainty). For example, the PersesProtocol is enough to give youa dG, but it doesn’t guarantee that it’s converged. So there you’re run many protocoldags and stack them together to get more accurate dG and uncertainty. I know that’s different from what OpenFE does. So the alchemiscale model is closer to the perses model, and the OpenFE model is different.
JC – It sounds like, in OpenFE, a single DAG needs to run in a single task…. Is checkpointing related to running things in parallel?
DD –
… (recording, ~20 minutes)
JC – Some ways to do this - Keep massive checkpoint file of everything, OR keep states of everything at the end of each sim to initiate the next one. I think the latter is better, and requires just a little refactoring of how we do storage.
DD – I don’t recall that being a priority on the OpenFE end
DS – Right, it’s something that’s possible for the roadmap next year.
JC – This would be useful for Perses - we’re having trouble converging some edges and being able to do this would really help.
DD – For actionable conclusions, I think alchemiscale 180 wuold be an improvement. The big points are that keeping these files will enable better error analysis, inspection of trajectories (like for science) and eventually extension for some protocols.
DD – I’d hoped to put this into 0.2…
JW – I approve adding this refactor to the 0.3 milestone, but think the F@H interface is more important.
JC – Agree and approve accordingly
DS – No veto from me - I’d just recommend not going too fast on this, since I need more time to work on the pieces that will interface with this/want to avoid duplicate work.