2023-08-01 alchemiscale Working Group meeting notes

Participants

  • @David Dotson

  • Jenke Scheen

  • @David W.H. Swenson

  • @Irfan Alibay

  • Levi Naden

  • Meghan Osato

  • @Mike Henry

  • @Jeffrey Wagner

  • @John Chodera

  • @Iván Pulido

  • @Joshua Horton

 

Recording: https://drive.google.com/file/d/1Nw96XZ5dBOo-HaxrnEP76DmMV4MFHyR4/view?usp=drive_link

Goals

  • alchemiscale.org user group

    • user questions / issues / feature requests

    • compute resources status

    • call for new users

    • current stack versions:

      • alchemiscale: 0.1.3

      • gufe: 0.9.1

      • openfe: 0.11.0

      • perses: protocol-neqcyc

  • JC : how a Protocol could be launched on a node that has multiple GPUs available in a manner that it could launch multiple processes via e.g. MPI to reduce wall clock times

  • JC : protein-mutation Protocol in perses

  • DD : alchemiscale-k8s:

    • been working with Ian Kenney on expanding the footprint of alchemiscale to commercial entities, add to long-term sustainability of alchemiscale and the broader openfe ecosystem

  • IP : Protein-ligand benchmarks working group update

  • alchemiscale development : sprint started 7/26 runs till 8/7

    • architecture overview : PL Benchmarks on FAH - Architecture v6.drawio

    • coordination board : alchemiscale : Phase 2 - User Feedback and Documentation

    • alchemiscale 0.2.0 milestone:

    • alchemiscale 0.1.4 milestone:

    • updates on In Review, In Progress, and Available cards

  • new discussion items from ASAP roadmap: ROADMAP: Computational Chemistry Core alchemiscale-related roadmap | Notion

Discussion topics

Notes

Notes

  •  

  • JC – Support for protein mutations

    • DD – I talked with Sukrit a few months ago on this, I think he was going to implement this

    • JC – I doni’t think he’ll have time. Ivy Zhang has a protocol in Perses that we may be able to adapt. This could be something we could share in GUFE.

  • DD – It sounds like Sukrit isn’t in a position to action this - Is there someone else who can drive this?

    • JC – Sukrit won’t be able to do it. Maybe we can tackle it with the perses dev team. The concern for this group would be to ensure that the protocol interface can support mutation protocols.

    • IP – I think there’s a lot of overlap - We’ll be building on top of the GUFE objects. But if we need things faster we may go outside this.

    • IA – Is IZ using the hybrid topology factory.

    • IA – From an OPenfE standpoint, beyond passing an atom mapping that indicates a protien mutation, is there additional work/infra needed?

    • JC – There isn’t much difference between this and the existing protocol. We just have to make sure the process knows what the mutating region is. So we need to make sure that translates well from one to the other. We had an issue before where we inadvertently had thousands of atoms getting transformed. So the program needs to know which parts are being transformed, otherwise we’ll have way too many atoms getting interpolated/soft-core’d. So we need to figure out where the responsiiblity lies for things like proteins where most of the atoms are the same.

    • IA – This will be up to RG - We’d planned to do protein mutations at some point but it may not be soon.

    • JC – Yeah, it’s hard to know how difficult this will be until someone tries it.

    • IA – I’ll discuss with OFE team and will bring it back at next week’s meeting.

    • MH – Yeah, if our pharma partners aren’t clamoring for this it won’t get huge priority. But if there’s a simple update to the GUFE objects that can be made to enable support for this, that may be a reasonable small work item to enable other folks to build out before it’s a OFE priority.

    • DS – One thought is that we don’t have to handle this all in GUFE - We can just say “there’s a mapping object that gets passed in, and this gets handed to the protocol”, and that would let other people handle the details.

    • MH – Right, this will let us start finding sharp edges early on, and upstream the changes into GUFE.

    • JC – This sounds OK - Just wanted to make sure we can try this now and avoid “we can’t do that” situations later.

    • DD – Do we have a driver?

    • JC – Sukrit can’t do this. IP, could you make an issue and have it assigned to yourself?

      • IP – Sure, can do. I’ll also see whether IZ can work on this.

      • DD – Is this in line with IZ’s doctoral objectives?

      • JC – I think she’s basically completed that work but could do this.

  • alchemiscale.org user group

    • user questions / issues / feature requests

      • MO – When I’m trying to rerun a transformation locally, I’ve tried lots of cuda drivers on my local HPC - Is there a specific one I should be using?

        • DD – CUDA > 11.7 has some weird behavior with some older CUDA drivers.

        • IP – On lilac, we have an issue where the installed CUDA drivers don’t support > 11.7. If you run nvisia-smi on the node, you can check which cuda driver would work.

        • DD – In alchemiscale-compute, we’re pinning to CUDA 11.7

        • MH (chat) – This is what I use to test "python -m openmm.testInstallation"

          • and if you get errors, this one will give you more information:

            python -c "import openmm; print(openmm.Platform.getPluginLoadFailures()

        • JC – And feel free to post in the free-energy-benchmarking channel with more questions

        • DD – Overall, this would be a good feature request - basically how to query the installed versions of everything for debugging/reporting purposes.

      • IA – Having an issue with the thing Hannah reported. Somehow there weren’t results for some records. This has happened twice now with submitting networks.

        • DD – Could you drop a snippet into #free-energy-benchmarks?

        • IA – I’ve DM’ed it to you. Basically she submitted a network, tried to get results, and got “this doesn’t exist” errors. Then she had the same problem in a different scope.

        • DD – Thanks for the report, this is helpful to have as user feedback. Sounds like somehow we’re reaching an inconsistent state between the object and the results store.

    • compute resources status

      • DD – Running on PRP, getting throttled on Lilac and also a little on PRP. We have about 60 waiting jobs from the OpenFF org, 47 running.

      • JC – I’ll work on getting more compute resources. I need to work on navigating the ACCESS system. I think we’d identified which NSF/NIH systems would be most useful for us - Could someone send me those notes? I could really use a list of the specific systems that we should get time on. https://allocations.access-ci.org/resources

        • MH – Ah, I remember, SDSC’s Comet would be a good fit. I’ll spearhead compiling this list.

      • JW – I chatted with Greg Bowman who runs F@H - He’s enthusiastic about our benchmarking and one of his postdocs reached out to me. Not sure if they’re programming-focused enough to really help but we’ll have an internal advocate.

        • JC – I’m also paying their lab annually for support, so we should be able to lean on them when we need it.

        • DD – F@H work will be targeted in 0.3 release, later this year.

        •  

    • call for new users

      • JS – Could you add MWeider and Chris Iacovella?

        • DD – Yes, will do

    • current stack versions:

      • alchemiscale: 0.1.3

      • gufe: 0.9.1

        • IA – Anticipating a GUFE release soon, will come around the same time as openfe 0.12. On the scale of days/weeks.

        • DS – Basically the big difference is in network planning - Improvements in how we use LOMAP to leverage a common core which makes them go faster. This is just a performance change, it shouldn’t break API.

      • openfe: 0.11.0

      • perses: protocol-neqcyc

  • JC : how a Protocol could be launched on a node that has multiple GPUs available in a manner that it could launch multiple processes via e.g. MPI to reduce wall clock times. So the protocol would need to say how many processes can be run, and the alchemiscale worker would need to be able to hear this and make a decision. This isn’t a huge priority but it would give us big speedups.

    • DD – Potential speedup? Single node?

    • JC – Linear with nGPUs. Single node.

    • DD – This would require a compute service that can sit on top of a node and count how many GPUs it has access to.

    • JC – It could be a commandline argument, like “use this many processes per task”

    • DD – How would it spread itself across multiple GPUs.

    • JC – We could make it so each process get a differnt CUDA_VISIBLE_DEVICES mask.

    • IA – There’s a lot of stuff in OMMTools that call mpiplus, is this already baked into OMMTools?

    • JC – Yes, you’d initiate the run inside of the mpi process (like inside mpirun)

    • IA – Could I have a look at an example?

    • JC – Yes, right now this just uses GPU device 0. There’s an option to use a config file. It’s complicated because it’s meant to run on nodes with different number of available GPUs. We can also simplify this in the future using something like DASK.

    • IA – From an OpenFE perspective, we’d be keen for this support. Our current compute resources

    • JW – Can we run single-GPU jobs currently? And just do multiple of them on the same node, with each using one GPU?

    • DD – Yes. But there’s an advantage here where some protocols can oversubscribe GPUs (like, submit multiple sims to one GPU and …) – OpenFE repex protocol consumes all of whatrever GPU it’s given. The noneqcycling protocol in perses doesn’t fully utilize a gpu so it can oversubscribe successfully. I think what JC’s asking for is “how can we take advantage of multiple GPUs on one node?”

    • IP – For example, in IZ’s work on protein mutations, we’re using two GPUs, and we want to fully utilize those. So she was running two replicas at the same time on two GPUs.

    • DD – Ok, so that wouldn’t work currently. But what the alchemiscale architecture would let us do is define a new type of compute service that sits on an entire node (or multiple GPUs) and does things cleverly. We’d discussed this before but hadn’t made plans to take action on this.

    • IP – …

    • IA – Could we move this to a power hour? I’m not sure where this would live.

    • DD – Agree, I’d like to determine where the complexity would live. RG is out this week and DS is out next week?

    • IA – When would be be moving on this?

    • JW – I wouldn’t really approve this before the F@H work completes, so this isn’t very urgent.

    • IA – I’ll bring this up as a discussion topic in the Aug 17 OpenFE power hour, once everyone’s back.

  • DD : alchemiscale-k8s:

    • been working with Ian Kenney on expanding the footprint of alchemiscale to commercial entities, add to long-term sustainability of alchemiscale and the broader openfe ecosystem. So while we could previously just use kubernetes jobs to host workers on PRP, this will also let people run alchemiscale servers in kubernetes.

    • IP – I may contact you with someone who is interested in this.

    • JC – Plugged this at the CADD GRC as well.

    • IA – Wearing my MDAnalysis hat, let’s discuss staffing on that - I’ve sent a DM.

  • JC – Is there a “how to join” page?

    • DD – Not yet. I’ll start an action item for this. Can put it on alchemiscale.org landing page.

  • IP : Protein-ligand benchmarks working group update - Meeting thursday.

    • JS – Please loop me in.

  • JW – I saw a new issue about a mistaken DOI - Could someone review my PR if I open one to fix?

    • IP – Will do.

  • updates on In Review, In Progress, and Available cards

    • OMMFFs 288 allows parsing of offxml strings

    • IA – PLB 93 – Needs discussion. Will discuss at power hour

    • DD – Alchemiscale 157 - deployment Docs - is waiting on review from me. MH, could you update branch protection settings?

      • MH – Will do

    • IP – Perses 1066 – Blocked by other perses PRs. It’s getting there.

    • DD – Alchemsicale 28 – User guide - on my plate.

    • DD – Alchemiscale 30 – deployment docs - Largely overlapping with 157 but there’s a bit of.additional work that I’ll do.

    • DD – Alchemiscale 130 – I’ll take this over (said I would last week, but now I’m actually assigning it to myself)

    • IP – “Test noneq protocol against repex protocol” – blocked by perses as well. Will need to check whether it’s worth it to run all PLBenchmarks. Will discuss thursday.

  • End of meeting

  • alchemiscale development : sprint started 7/26 runs till 8/7

    • architecture overview : PL Benchmarks on FAH - Architecture v6.drawio

    • coordination board : alchemiscale : Phase 2 - User Feedback and Documentation

    • alchemiscale 0.2.0 milestone:

    • alchemiscale 0.1.4 milestone:

    • updates on In Review, In Progress, and Available cards

 

Action items

@Mike Henry and @David Dotson will identify clusters for NSF ACCESS proposal; give recommendations for @John Chodera
@David Dotson will issue user identities for alchemiscale.org to Marcus Wieder and Chris Iacovella
@David Dotson will make PR for landing page for alchemiscale.org to direct interested parties for how to get help/involved

Decisions