2024-04-23 alchemiscale : user group meeting notes

Participants

  • @David Dotson

  • Ian Kenney

  • @James Eastwood

  • Jenke Scheen

  • Meghan Osato

  • @Jeffrey Wagner

  • @Mike Henry

  • @Matt Thompson

Goals

  • alchemiscale.org

    • compute resources status

    • current stack versions:

      • alchemiscale: 0.4.0

      • neo4j: 5.18

      • gufe: 0.9.5

      • openfe: 0.14.0

      • perses: protocol-neqcyc

      • openmmforcefields: 0.12.0

  • DD : working on testing MIG splitting on Lilac A100

  • JE – Soft announcement that OpenFE 1.0 is released. Still needs a little testing but the release is out in the wild.

  • JW – Feasibility of trying to run PLB set before annual meeting to show “hey, look what we can do”? I have low certainty about correctness and want to see if this would be more trouble than it’s worth.

Discussion topics

Notes

Notes

  • alchemiscale.org

    • user questions / issues / feature requests

      • JS – Could we have this segment pre-populated/have folks add their questions ahead of time?

        • DD – Yes, could do that.

      • JS – Update on charging issue? https://openfreeenergy.slack.com/archives/C042D5STZV1/p1712900331935479

        • DD – Right, still looking into this. I haven’t been able to reproduce- The calc passes on vulkan, it passes when I run the worker image locally. My suspicion (yet to be confirmed) is that NRP has some weird hosts, and some docker containers on weird nodes have odd behavior. So I’m trying to tease apart which clusters/nodes the bad runs are happening on. If it turns out to be this way, I’ll restrict the deployment to avoid those nodes. I do have an issue (alchemiscale #258)to keep bad nodes from erroring out all the tasks in a network.

        • JS – In a network on 270 tasks, only 13 succeeded, and the rest went to an error state with that specific error. Then we resubmitted and got 90 succeeded and the rest errored.

        • DD – Ok. This is consistent with having several decent workers but one bad one erroring out lots of jobs. I’ll error cycle the current runs and see how things behave.

        • JS – One other observation I’ve made is that JAK2 isn’t getting this error at all. Though that has a different chemotype than normal/.

        • ….

        • (see recording ~25 mins)

        • DD – I’ll try restricting this to not run on bad nodes and will share results.

        • JS – That works for me, if we can resolve in the next two weeks that’d be ideal.

        • DD – Yes, that sounds good. I can prioritize getting your calcs done as well since I know they’re important.

    • compute resources status

      • Mostly working on a large living network for ASAP, and a few confidential ones.

      • Almost at 100k complete tasks in the whole lifetime of alchemiscale.org (just turned one year old!). Used about 500k GPU-hours.

      • Iris is still down for maintenance, but we’re running on Lilac and NRP.

    • current stack versions:

      • alchemiscale: 0.4.0

      • neo4j: 5.18

      • gufe: 0.9.5

      • openfe: 0.14.0

      • perses: protocol-neqcyc

      • openmmforcefields: 0.12.0

    • Future stack versions

      • Will update openFE and gufe. If possible, will deprecate perses and integrate feflow (need to communicate with devs)

  • DD : working on testing MIG splitting on Lilac A100. A100s can be split 7 ways, I’m working with MSKCC HPC staff, we’re going to test performance soon. Will report on results and we can make a decision as to whether we want to do more MIG splitting on Iris (they don’t want to do it on lilac since it’s being decommissioned).

    •  

  • JE – Soft announcement that OpenFE 1.0 is released. Still needs a little testing but the release is out in the wild.

    • DD – Gotcha, We can start rolling this out into runner images in the next week. No estimate for difficulty yet - might be easy, might be hard and require lots of fixes/work for previous work.

    • JW – IIRC, we’ve decided that reverse compatibility isn’t a guarantee during this early development. So if it’d be a lot of work to integrate old results into the new server, it’d be fine if we dump the current state to something read-only, and start from a clean slate moving forward.

    • DD – Right, we don’t have data lifecycle defined yet. I think we had an open issue for archival-style export (#246). We’re planning to address this in the next major release (might just be a recommendation/method for result archival)

    • JW – Ok, as you move forward, if it looks like reverse compatibility would be hugely painful, OpenFF is willing to cut a lot of slack in the interest of keeping forward velocity, just let us know.

    • DD – Could keep an old server hot in a read-only mode. I’ll let you know how this goes.

    • MH + JE – Full release announcement (on social media etc) may come some time next week. Doing final polishing/checking like ensuring tutorial is happy.

  • JW – Feasibility of trying to run PLB set before annual meeting to show “hey, look what we can do”? I have low certainty about how the upgrade will go and want to see if this would be more trouble than it’s worth. Right now when we put together presentation/posters, the FE calcs are just with limited FFs/targets. But I’d love to have a single massive study that shows all targets on all force fields.

    • MH – Might be possible with a single target. But not with all targets.

    • DD – Once we have OPenFE and GUFE 1.0 integrated and compute on F@H, this might be really feasible.

    • JW – Thanks everyone for the feedback, I’m not going to push for action on this before the annual meeting, but I’d love to restart this conversation once we’re humming along later in the year.

    • JE – In the OpenFE benchmarking project, one big goal we’re discussing would be to have a NEW PLB set with contributions from industry partners. Lots of little steps involved, I’ll update you as this progresses.

  •  

  •  

  •  

  •  



Action items

Decisions