2024-06-04 alchemiscale : user group meeting notes

Participants

  • @David Dotson

  • Jenke Scheen

  • @Irfan Alibay

  • @James Eastwood

  • @Joshua Horton

  • @Jeffrey Wagner

 

Meeting recording: https://drive.google.com/file/d/166FfwOQq9Z6KoODLk3h5idx_JNlBRdwn/view?usp=sharing

Goals

  • alchemiscale.org

    • compute resources status

    • current stack versions:

      • alchemiscale: 0.4.0

      • neo4j: 5.18

      • gufe: 0.9.5

      • openfe: 0.14.0

      • perses: protocol-neqcyc

      • openmmforcefields: 0.12.0

  • JS – user-settable error restart policies

  • JS – Charge setting through OpenFE

  • DD: working with IK on finishing out alchemiscale release 0.5.0

    • includes:

      • openfe + gufe 1.0 compatibility

      • Folding@Home compute support

      • feflow inclusion, drop of perses

    • will be deployed on a new host, new database as api.alchemiscale.org with advance notice to users

      • current api.alchemiscale.org instance will be moved to api.legacy.alchemiscale.org, kept around for some time, but with no new compute provisioned

  • DD : working on testing MIG splitting on Lilac A100

Discussion topics

Notes

Notes

  • alchemiscale.org

    • compute resources status

      • DD – Plenty of capacity available. Compute resources are idling, nothing running currently.

      • JS – I’ll submit lots of stuff later this week. Will be public.

    • current stack versions:

      • alchemiscale: 0.4.0

      • neo4j: 5.18

      • gufe: 0.9.5

      • openfe: 0.14.0

      • perses: protocol-neqcyc

      • openmmforcefields: 0.12.0

  • JS – Charge setting through OpenFE

    • JS – Core issue about charging is here:

      • https://openforcefieldgroup.slack.com/archives/C02GG1D331P/p1716285193420149

      • JS – This seems to primarily be an issue on NRP. HMO expects that this will happen as much, if not more, on F@H. From a user perspective, on NRP about 20-30% of our tasks fail. There’s plenty of compute, but each time 20-30% of calcs fail, so I’m having to repeatedly restart jobs to approach completion. My proposal would be to have a metadata file that maps errors to resolutions, for example giving the AMBER error would cause a compute node to be blacklisted. Are there plans for this?

      • DD – We do indeed have some open issues on this, I think I’ve triaged them for an upcoming release (not the OpenFE/GUFE 1.0-supporting release), but the one after it (0.6). Alchemiscale #258 is to track compute servers that consistently fail.

      • JS – I think this is slightly different - I’d propose explicitly checking the contents of an error and mapping it to a resolution.

      • DD – I have another issue for that (mapping regex error matches to retries). Hard to track the identity of compute nodes… right now they’re UUID based which have an element of randomness… Could make them deterministic but would be tricky on container orchestrators like NRP where the pods themselves kinda don’t know the machine they’re running on.

      • JS – It’s good that this batch of issues is going to have about the same timing as F@H deployment - There will be new issues on F@H so this will give us flexibility to handle them as they come up.

      • DD – This conversation is giving me ideas on how to expand on alchemiscale #211 / create a new issue. (will probably be alchemiscale 277)

    • IA – Re: charge setting,

      • DD – GUFE #322 discusses cases where pickled RDMols don’t by default keep charges. It also presents a resolution where we tell RDKit to include charges in pickling.

      • IA – What confuses me here is that we’ve already validated user charges - This is how we did the NAGL benchmarks.

      • DD – In alchemiscale-fah, we use multiprocessing, which requires things to get pickled to be distributed to workers. Whereas in standard operations we don’t pickle. JH did something different.

      • JH – We just do multiprocessing in the alchemiscale client… I don’t think this affects the calcs, but it DOES …

      • JS – The multiprocessing in the alchemisclae client was introduced AFTER the OpenFE NAGL benchmark.

      • … (See recording, ~26 mins)

      • …

      • (resuming at ~30 mins)

      • IA – This seems like a reasonable fix. Does anyone see issues?

      • JW – I'm moderately experienced with RDKit/molecule representations and don’t see any obvious issues with this approach.

      • IA – Any suggesions for how to test this?

        • DD – I’ll open a PR to GUFE and try to put in a test.

      • IA – Timeline?

        • DD – I have an immediate workaround (the code block in GUFE #322), so this isn’t super urgent.

        • IA – I’ll plan to put this in our next GUFE release.

        • (DD made a milestone for GUFE 1.1 and added this)

        • IA – We should also get JH’s other fix (atomicproperties being lost) into this release.

    • JS – Also seems to be a kind of issue with OpenFE overwriting/losing some user info.

    •  

  • DD: working with IK on finishing out alchemiscale release 0.5.0

    • includes:

      • openfe + gufe 1.0 compatibility

        •  

      • Folding@Home compute support

      • feflow inclusion, drop of perses

        • IA – It looks like there may be some settings that would need to be changed, which might break the schema and require migrations.

        • DD – how important is it to change the schema?

        • (technical discussion, see recording ~39-44 minutes)

        • DD – There are some coordinated changes that need to happen between feflow and other packages. I could take this over for IP.

        • IA – I’ll also take a look.

        • DD – Could you take the lead on feflow #38?

        • IA – Can’t take the lead, can do some work on feflow #38 if time permits but don’t anticipate having a ton of time available.

        •  

    • will be deployed on a new host, new database as api.alchemiscale.org with advance notice to users

      • current api.alchemiscale.org instance will be moved to api.legacy.alchemiscale.org, kept around for some time, but with no new compute provisioned

  • DD : working on testing MIG splitting on Lilac A100 - Haven’t been able to spend a ton of time on this so not much to report.

DD + IA – FYI, DD will be in transit in 2 weeks, so the Alchemiscale user group meeting may be cancelled. Expect updates closer to then.

Action items

Decisions