2024-05-07 alchemiscale : user group meeting notes

Participants

  • @David Dotson

  • Ian Kenney

  • @Irfan Alibay

  • @Mike Henry

  • @Jeffrey Wagner

  • Jenke Scheen

  • @Matt Thompson

  • Meghan Osato

  • @James Eastwood

Goals

  • alchemiscale.org

    • major milestone: we’ve crossed the 100,000 completed FEC mark!

    • compute resources status

    • current stack versions:

      • alchemiscale: 0.4.0

      • neo4j: 5.18

      • gufe: 0.9.5

      • openfe: 0.14.0

      • perses: protocol-neqcyc

      • openmmforcefields: 0.12.0

  • IK : performing QA tests on openfe + gufe 1.0

  • DD : working on testing MIG splitting on Lilac A100

    • likely have to wait until after Leiden

Discussion topics

Notes

Notes

  • alchemiscale.org

    • major milestone: we’ve crossed the 100,000 completed FEC mark!

      • DD – About 0.5 million GPU-hours

      • JS – Another cool way to show Y axis could be milliseconds. Maybe some day in the future.

      • IA – Could assume each hrex? sim is 5 or 6 ns

    • compute resources status

      • DD – Not much in the queue currently.

      • JS – I just submitted a public set.

        • DD – I don’t see any jobs waiting, maybe they’re already running.

        • JS – Ah, right, I don’t see any waiting either, they’re all running. I see a lot of errors about AmberTools file invalid characters. The workers are on Iris. Maybe same problem as before?

        • DD – That problem was on NRP. Basically, once a single job failed, all the subsequent ones would fail. I’ve changed the NRP workers to just run a single task and then shut down to prevent this contagion. I think you can get through this by resubmitting the jobs repeatedly, and some fraction will succeed each time. I’ve been unable to reproduce locally.

        • JW+IA – Could use user charges?

          • JS – That what I did with another set and I saw 50% failure.

        • JW – Have compute images been updated since the OMMFFs/AmberTools change last week?

          • DD – No

        • MT (chat) – I agree with Irfan & Jeff’s assessment about what “should” happen, but can never bet 100% that what should happen does happen. We could add an integration test that covers this case - when AT is not installed (or just mock it to be broken/missing) and users bring charges, make sure OpenMM stuff can be made without antechamber/sqm/etc being called

        • DD – JS, please send me your error messages once you retrieve them.

        •  

      • IA – I’ll be submitting some networks to run with somewhat high urgency for workshop demo. Need it for next Thursday, ideally as soon as possible.

    • current stack versions:

      • alchemiscale: 0.4.0

      • neo4j: 5.18

      • gufe: 0.9.5

      • openfe: 0.14.0

      • perses: protocol-neqcyc

      • openmmforcefields: 0.12.0

  • DD – IK is working on QA tests with OpenFE and GUFE 1.0

    • IK – Using the RC builds (assuming they’re the same as the stable full release) - deployment was clean and nothing broke. It looks like there were some schema changes eg alchemicalsamplersetting renamed. But most things are clean mappings from old to new, might just need a migration script. I anticipate it’ll be pretty straightforward to fix. I think there were 4ish of these changes.

    • IA – That sounds right. See Swenson’s migration guide.

    • DD – (details of migration workflow. see recording ~20 mins)

    • IK – There might be more than 4 changes, I’m only seeing the first failure.

    • IA – There were a lot of changes. If you’re doing migrations manually it may be a lot of work.

    • DD – Worth mentioning that when we upgrade, we don’t plan to maintain the old database.

    • MH – Could leave a legacy server up for a few weeks. Could add

    • DD – Yeah, a temporary legacy server would work.

    • JW – Could we get a “just add water” tarball that we upload to Zenodo?

      • DD + MH – That’s harder than it sounds.

    • IA – After alchemistry / OMSF workshop, we should have a call where we (appropriate stakeholders) try to work out how to make sure we avoid this issue going forward. I suspect Swenson's migration idea might work here, but I suspect it'll need discussion.

    • DD – Any objections to us NOT migrating previous records forward?

      • (No objections)

    • DD – Ok, we’ll leave the old data in place for a while, but note that this will be temporary. Eventually the URL will be replaced with the “new” server without the old data. This won’t happen in the next two weeks, so continue using it as normal for the next two weeks.

  • DD : working on testing MIG splitting on Lilac A100

    • DD – likely have to wait until after Leiden

  • JS – Update on F@H deployment?

    • DD – Basically in the same state as last week. Working on integration tests for the compute service itself.



Action items

@David Dotson try creating a throughput plot in ns of simulation time

Decisions