2024-04-30 alchemiscale : dev group meeting notes

Participants

  • Ian Kenney

  • @David Dotson

  • @Irfan Alibay

  • @John Chodera

  • @Mike Henry

  • @Jeffrey Wagner

  • @Matt Thompson

Goals

  • DD : alchemiscale roadmap

    • Q1 : complete “living networks” performance improvements

    • Q1 : Folding@Home compute services deployed in production

      • finish MVP, with integration test suite by 2024.03 2024.05

        • this is delayed; need an additional 2 weeks to finish this out

      • perform FAH tests with volunteers during 2024.04 2024.05

        • public work server up by 2024.03.15 2024.05.10

        • confidential work server up by 2024.04.01 2024.06.01

    • Q2 : develop Strategy structure, initial implementations

    • Q3 : enable automated Strategy execution by end of Q3, 2024 (2024.10.01)

  • IP: feflow needs

  • DD : technical questions on SQM charging, avoiding and restricting core count

  • alchemiscale development : new sprint spanning 5/1 - 5/10

    • aim is to complete 0.4.1, deploy to alchemiscale.org, including openfe + gufe 1.0:

    • architecture overview : PL Benchmarks on FAH - Architecture v6.drawio

    • coordination board : alchemiscale : Phase 3 - Folding@Home, new features, optimizations, targeted refactors

  • JW – I’d love to have some chart of throughput, to make a statement in our annual talk like “based on the openff-2.2.0 vs. 2.1.0 benchmarking, here’s how many FFs we can benchmark in a month” (though I understand if it’s not that simple)

Discussion topics

Notes

Notes

  • DD : alchemiscale roadmap

    • Q1 : complete “living networks” performance improvements

      • DD – Release 0.5 will have several improvements for living networks. Eg IK is working on protocoldag caching.

    • Q1 : Folding@Home compute services deployed in production

      • finish MVP, with integration test suite by 2024.03 2024.05

        • this is delayed; need an additional 2 weeks to finish this out

        • JW – The demo a few weeks ago where there was data going to a work server and being delegated to a local worker - Were those test messages or actual simulation data? I spoke to the ad board the other day and said that the major complexity right now is testing.

          • DD – Actual simulation data. Right now we have a way to delegate a work unit to a fah work host and get it converted into a F@H native work unit… (see recording ~10 mins)… But yeah, right now we have a big mocked work server for testing that has realistic lag times and other things to ensure that our execution works in realistic contexts. There are also some changes to alchemiscale itself that are being made to enable compatibility with F@H. And IA, could you take a look at feflow 451 (IP just went off on vacation)?

          • IA – Can do.

          • MH – This will just increase the disk space a little? Looks like it’s just dumping the system state to xml?

          • DD – Yeah, we need to, eg, send over the simulation with velocities

          • JC – Used to validate the CPU. We send over this info and then check it against the CPU and GPU to ensure the volunteers aren’t cheating.

          • IA – Glancing at this, the PR looks good and I can merge it soon

          • DD – The PR is still WIP but I’ll turn that over ASAP.

        • DD – So I’ll be working with IK to work on debugging communication between services.

      • perform FAH tests with volunteers during 2024.04 2024.05

        • public work server up by 2024.03.15 2024.05.10

          • DD – When we met with joseph about F@H computation few weeks ago he told us not to worry about points yet. That can come later.

          •  

        • confidential work server up by 2024.04.01 2024.06.01

          • JC – We’ve communicated this to the F@H leaders, and I can make a broader slack/discord announcement about this when it’s getting close.

          • DD – Might be good to not delay the announcement in case things get delayed.

          • JC – Ok, I’ll announce on slack and discord soon and tell them to expect the encrypted work units in the upcoming months.

    • Q2 : develop Strategy structure, initial implementations

    • Q3 : enable automated Strategy execution by end of Q3, 2024 (2024.10.01)

  • IP: feflow needs

    • (Skipped since IP is offline)

    • DD – There are 3 feflow PRs on the board. IA, could I

    • IA – My question for IP if he were here would be to ask whether feflow 38 is blocking the sims he wants to have run for the annual meeting. DD, what are your needs?

    • DD – If possible for the annual meeting we want to have alchemiscale 0.4.1 which uses OPenFE and GUFE 1.0, have feflow 0.1, and drop perses.

    • IA – Target date for 0.4.1??

    • DD – Before I travel, hopefully (before 5/10). It’s also ok if we fall short of this - We’ll also be doing deployment tests of gufe and openfe 1.0 before then, but we’ll need feflow 0.1.

    • IA – I need to work out what I’ll do for the openfe demo, and I want to have some calcs submitted that I can pull down for that.

      • DD – You should submit those calcs now, with the current stack. There are a lot of things that could turn into blockers to deployment of the new stack before the annual meeting.

      • IA – Ok, will do.

    • DD – Open PR for enabling extensions for noneqcyc prototocols (feflow 44) - IK any update?

      • IK – Waiting on OpenFE and GUFE 1.0 releases

      • DD – Understood. This isn’t super critical, just want to make sure we don’t lose track of it.

    • DD – The F@H support CAN exist without an feflow release

  • DD : technical questions on SQM charging, avoiding and restricting core count. The test suite in alchemiscale-fah is being slowed down by charge assignment. If charges aren’t already specified on the small molecule components they’ll be sent to sqm. For CI we don’t really care about charge quality, so I’m using “formal_charges” charge assignment. But I’m still seeing sqm running.

    • IA – If you have partial charges assigned ahead of time, it should skip sqm. I’m looking at the feflow logic now. At a glance I can’t tell whether some charges being exactly 0 would skip things. I’d recommend defining the charges ahead of time in the SDF file.

    • JC – It looks like OMMFFs might not override charge assignment for OpenFF charge assignment if user charges are present.

    • IA – I’m pretty sure that we tested this exact thing at OpenFE

    • (some rooting around different code paths to check this out)

    • JC – I think that adding print statements all throughout the call stack would help here to figure out what’s going on

    • JW – I recommend trying gasteiger first

    • IA – Could also try using nagl, unless OpenFF objects

      • JW – Yeah, for non-production work, absolutely DO use nagl. This will run very fast and help future proof you for the day that we DO recommend mainline nagl use.

    • MH – Then I think we should add espaloma-charge and nagl to docker image

      • DD – Could someone drop me a tip here?

      • MH – I’ll open a PR. There’s some complexity with py311 and DGL missing some upstreams on mac.

      • JC – Can we talk about nagl deployment issues at some point? Could export models to other formats to avoid need for DGL.

      • MH – … (see recording ~50 mins)

      •  

    • JW (chat) – Also I see you wrote in the agenda about restricting core count to make sqm happy - We’ve also found that things get slow if there are 4+ cores available to sqm. This can be controlled by setting the OMP_NUM_THREADS env variable.

  •  

    • JC – OE charging should be much faster.

    • MT – charge_from_molecules

    • IA – IMHO the best approach here is to have your SDF files have the partial charges already.

    •  

    •  

  •  

  • alchemiscale development : new sprint spanning 5/1 - 5/10

    • aim is to complete 0.4.1, deploy to alchemiscale.org, including openfe + gufe 1.0:

    • architecture overview : PL Benchmarks on FAH - Architecture v6.drawio

    • coordination board : alchemiscale : Phase 3 - Folding@Home, new features, optimizations, targeted refactors

  • JW – I’d love to have some chart of throughput, to make a statement in our annual talk like “based on the openff-2.2.0 vs. 2.1.0 benchmarking, here’s how many FFs we can benchmark in a month” (though I understand if it’s not that simple)

    • DD – It’s not that simple, not all edges take the same amount of time for a variety of reasons, and there are other issues.

    • MH – I could tell you how many hours it takes on a single GPU

    • IA – One of the other useful things to talk about is that the OpenFE+OpenFF lines have been

    • JW – Could I say “excluding ligands with charge changes, we can get through our all the targets in our protein ligand benchmark set in under two weeks”

      • IA + DD – Yes

    • IA – And we could do the five targets for MO is about 3 days going full tilt.

    • DD – And we’ve been able to get 200+ GPUs at a time on NRP.



Action items

@David Dotson will get Jeff source throughput data, notebook snippet for plot

Decisions