2020-06-01 Core Devs Meeting notes

Date

Jun 1, 2020

Participants

  • @Jeffrey Wagner

Discussion topics

Notes

Notes

  • Roundtable check in

  • SB – Incredible RESTful API and public hosting point for nonbonded optimization. Needs tests, polish, provenance/dependency versioning, but is pretty mature.

  • DD – QCFractal maintenance mostly. PR reviews. Working on WBOs.

  • JS – Fixed a few bugs in taproom and paprika. Met with TSCC, they gave a few suggestions, but they don’t totally make sense to me.

    • SB – Heartbeat is when workers and managers periodically message each other to prove they’re alive. We do check if workers go down and respawn them. If a job errors, we typically don’t restart it. We could add in a “retries” argument, but this is dangerous because it might indicate that the job is bad.

    • DD – If jobs get stuck in I/O waits, they may hang for a long time. I see John had suggested using partd as an in-memory key-value store (basically a virtual file system)

    • SB – We end up having a ton of data around – Like half a terabyte of trajectories, so building our own virtual file system is going to be trouble. Sometimes workers have to pass files around, and it’s hard to control how they get connected.

    • DD – Is it possible to make granularity of parallelization less granular (can we make interdependent tasks run on the same worker?). And is there a way to do less aggressive checkpointing?

    • SB – Reducing granularity is possible – Protocols are one node in the execution graph, but there’s also a “protocolgroup”, which we could use to run things on the same node. However DASK can be flaky with output of tasks. So if it loses workers, it will try to regenerate data. I put in a hack to fix this.

    • SB – We should identify the architectures that we want to target for multi-site parallelization, since data locality will be important. If it’ll all be on PRP/OSG, then we can build according to certain constraints. But if it’ll be on a bunch of individual lab clusters, then we need a different solution. There’s also some questions of simulation caching.

  • JH – Worked on QCSubmit – One of the first datasets just went in. Working on fixing that today – Some spelling mistakes. Setting up testing – My GHA tests now create a snowflake server.

    • DD – would love to talk about this. We could integrate retry logic and a few other things from my submission script.

  • MT – Worked a bit on QCSubmit, enjoyed testing JH's script that gets MM energies. Some dependencies on the MM energy evaluator. Learned a lot about GHA. Getting pulled in a lot of directions, not sure about API. Made openff-system repo on GitHub to start prototyping. Constructor takes in openff Topology and ForceField, but I’m not sure how exactly we should inferface with/expand on those objects.

    • All – We’re happy to iterate quickly on prototypes/APIs right now, even if they’re not perfect.

    • JW – I’m happy to be involved in debugging and extending/replacing current Toolkit Molecule and Topology classes.

    • SB – Should reach out to Andrea, Levi, and Bas to get requirements for FE setups. Hannah Bruce Macdonald and Dominic Rufa can also share some idea, but they’re doing relative free energy instead of

  • JW – Fixed proline SMARTS things by removing chemper file/unblocked AMBER FF porting. Almost done with chargeIncrementHandler (PR too large)

  •  

  • Sprint planning

    • New tasks

      • 1.2.0 release

      • TSCC/Lustre/pAPRika debugging

      • DASK errors where workers lose their data? (somewhat nebulous, hard to reproduce)

        • SB will make a Evaluator issue for this

Action items

Decisions