2024-03-12 alchemiscale : user group meeting notes

Participants

Goals

alchemiscale.org
- user questions / issues / feature requests
- compute resources status
- current stack versions:
  - alchemiscale: 0.3.0
  - neo4j: 4.4
  - gufe: 0.9.5
  - openfe: 0.14.0
  - perses: protocol-neqcyc
  - openmmforcefields: 0.12.0
DD : new proposed stack versions for alchemiscale.org ; awaiting results of QA tests on gufe + openfe 1.0rc1
- alchemiscale: 0.4.0
- neo4j: 5.16
- gufe: 1.0
- openfe: 1.0
- feflow: main
- openmmforcefields: 0.12.0
JS – Automated restarts on alchemiscale/OpenFE level
JW : benchmark OpenFF release candidates, steps to do this well?

Discussion topics

Notes

alchemiscale.org
- user questions / issues / feature requests
- compute resources status
  - DD – Plenty of compute power to run things. Iris was having soem issues
  - MH – I’ve fixed iris, you can use it when needed.
- current stack versions:
  - alchemiscale: 0.3.0
  - neo4j: 4.4
  - gufe: 0.9.5
  - openfe: 0.14.0
  - perses: protocol-neqcyc
  - openmmforcefields: 0.12.0
DD : new proposed stack versions for alchemiscale.org ; awaiting results of QA tests on gufe + openfe 1.0
- alchemiscale: 0.4.0
- neo4j: 5.16
- gufe: 1.0
- openfe: 1.0
- feflow: main
- openmmforcefields: 0.12.0
- DD – Some issues, IK and I are working on it.
- IA – Which QA tests are you running?
  - MH – Tests we’re running to make sure RC is good.
  - IA – I was under the impression that we’d run testing on alchemiscale with the whole PLB
  - MH – Right, but that’s blocked because of the issues that DD is working on. https://github.com/openforcefield/alchemiscale/pull/254. So we can’t test on alchemiscale because we can’t deploy on alchemiscale. But once that’s resolved we can run the whole benchmark.
  - DD – Right, once MH has deployment figured out, he’ll tell me and I’ll spin up workers using that image.
JS – Automated restarts on alchemiscale/OpenFE level. We’ve been running a good amount of compute through alchemiscale, both in low and high volume. I’ve noticed that 5% of tasks end up in error status. It’s always the same error (something with openmm, a random numerical error).
- IA – Is this the NaN error?
- JS – Yes. And we’ve build tooling around this to restart NaNs using the CLI, and this generally results in the tasks completing successfully. I remember we’ve talked about built-in restarts already, is there machinery to handle this or could I request it?
- DD – I believe, the way we’re using compute services, when we do executedag, we have n_retries set to 3 (and that’s in deployment). That means, at the gufe layer, the executedag will retry 3 times, though that’s in the same process on the same worker, so if it’s a problem with the machine then it will trip all three retries.
  - IA – Generally, we don’t see a single worker/process consistently failing. There are cases where, if you have really bad nodes, you’ll get runs failing consistently. But that’s really rare.
  - JS – From what I see, resetting the processes from error generally succeeds immediately. So I think this means that the node had a bad GPU or something else about the machine.
  - IA – That makes sense. It could be good to track failures from different nodes and stop sending them jobs if they keep failing. Also, are you setting platform directly? If you don’t set it, a node with misconfigured cuda could be selecting opencl.
    - possible to kill worker if it fails multiple times?
    - JS – Is platform selection handled by alchemsicale?
    - DD – No
    - IP – The platform selection should be on the openmm/openmmtools side. But maybe what we want to do on the alchemy side is to force a platform.
    - JS – Hm, do we want to force cuda then?
    - IP – Unsure, then you can’t use clusters w/o cuda.
    - MH – Maybe it should be set in the protocolsettings
    - IA – For us, our default is that anything with solvent gets cuda, and anything in vacuum gets cpu.
  - JS – Ok, I’ll work with JH to change our submissions such that things run on CUDA. And if I can get more data, I’ll share it on the issue tracker.
  - DD – If this continues being troublesome, we could work on adding funcitnality to handle task errors more elegantly, and to try various workers before entering error state.
  - JS – That sounds good. Also, it could be good to monitor failure rates on different nodes and avoid sending tasks to bad ones.
  - DD – We could make a registry in the central server to keep a history of nodes and not send jobs to bad nodes.
  - IA – Another issue that’s been affecting folks is that the forward/reverse analysis can fail. If you have a large tau… This should be fixed in openFE 1.0, in case you’ve been encountering that.
- JS will experiment with directly setting platform and will report back on success, could motivate server-side retry logic. https://github.com/choderalab/asapdiscovery/issues/905
JW : benchmark OpenFF release candidates, steps to do this well?
- MO : was going to run the 2.2 benchmark part; Hannah running 2.1, want to keep settings the same
- JW : unsure if this is the right forum; however, want to make sure that we don’t end up with an old openff-forcefields package deployed on alchemiscale.org
  - IA : sounds like we also want to get gufe+`openfe` 1.0 deployed first?
  - IP : do you plan to use Protein+Ligand benchmarks
  - JW : on a technical level, is it possible for Meghan to run sage with latest 2.1, 2.2, etc.
    - MH + DD : can inject custom FFs into a Protocol Settings, have used this functionality before
    - IA : also need to determine which systems in PLB are suitable for this; that’s what Hannah is finding. There’s a bit of manual intervention required for old PLB so it may not be as easy as pushing a button and throwing compute at it.
  - JW : will resolve this with James and Meghan
  - IA : think your bottleneck is choosing which systems out of the PLB
- DD : can we use #free-energy-benchmarks for discussion on this?
  - JW will start a discussion on free-energy-benchmarking
JS – Could be fun to have a leaderboard of who’s run the most on alchemiscale.

Action items

David Dotson will perform research cycle on public leaderboard, consider papermill approach used by ASAP

Participants

Goals

Discussion topics

Action items

Decisions

0 Comments