/
2024-04-09 alchemiscale : user group meeting notes

2024-04-09 alchemiscale : user group meeting notes

Participants

  • @David Dotson

  • @Matt Thompson

  • @Irfan Alibay

  • @James Eastwood

  • Jenke Scheen

  • @John Chodera

  • @Jeffrey Wagner

  • @Joshua Horton

  • Ian Kenney

Recording: https://drive.google.com/file/d/1aiogYhnn1jzz4F33eOFUnqRn-US2xj4r/view?usp=sharing

Goals

  • alchemiscale.org

  • JS : networks with set higher priority donโ€™t seem to be actioned more than networks with lower priority

  • JC : ways to split a single GPU into multiple virtual GPUs on Lilac + Iris?

    • 7 MIG GPUs possible with A100s

      • more less powerful GPUs

Discussion topics

Notes

Notes

  • alchemiscale.org

    • DD : alchemiscale 0.4.0 released and deployed Friday 2024.04.05: Release Release v0.4.0 ยท OpenFreeEnergy/alchemiscale

      • DD โ€“ Many improvements for ASAP - Things should be much faster now.

      • DD โ€“ Also adds network state - If set to inactive, the network/tasks wonโ€™t be returned when some API calls are made.

      • DD โ€“ Various other improvements that mean central server load is now way down. This is thanks to IK

      • DD โ€“ Upgraded to neo4j v5 and the new python-neo4j driver. Many many improvements from that upgrade as well.

      • JS โ€“ Nice work, I notived the difference almost immediately when the update is rolled out.

      • JS โ€“ JH asked re: living networks whether weโ€™ll be able to add edges to an existing network.

        • DD โ€“ Fundamental to GUFE is that we canโ€™t add edges to existing networks. Instead you can make a NEW network with a superset of the edges of the original one. If submitted to the same scope, the already-existing edges will be deduplicated.

        • JS โ€“ For the superset functionality, is that something that youโ€™ll provide or that we should plan to build?

        • DD โ€“ I acknowledge that the retrieval of a big network is slow (~20 mins) and that submission also takes a while. But yeah, youโ€™d do this client-side - Retrieve an existing network, make a superset of it locally, and then submit it.

        • (JS, BRies, and DD will have a session to whiteboard out how network planning with multiple target structures would look)

    • user questions / issues / feature requests

    • compute resources status

      • DD โ€“ Using as much of NRP as we can for ASAP-public compute (about 150 GPUs). Smaller number of private jobs running on Iris and Lilab.

      • (DD writes API query to see where jobs are running, see recording ~16 mins)

      • (34 minutes in ) JC โ€“ Has anyone experimented with fragmenting A100s (having them running multiple jobs)? Since each process is only using 8GB there should be room for several sims. This could be a way to get more mileage form lilac

        • IA โ€“ I asked MH to look into this but havenโ€™t heard back.

        • DD โ€“ Last year I tried submitting multiple jobs to a single GPU. Would this be different?

        • JC โ€“ A100s have additional support for partitioning into logical sub-blocks. But this requires an admin to set this up. This might be beneficial given the current lack of GPUs.

        • DD โ€“ Do you think theyโ€™d be interested in doing this?

        • JC โ€“ Since lilac is being slowly dismantled, it would be good for them to know that people can do useful things with larger numbers of smaller GPUs. Otherwise we kinda have a risk of underutilizing powerful GPUs.

        • DD โ€“ Last year we saw that OpenFEโ€™s repex protocol saturated GPUs quite well. Though I canโ€™t recall if that was with A100s. So itโ€™d be good to start a conversation with MSK HPC about how our utilization looks and whether theyโ€™d be interested in trying this.

        • JC โ€“ Sure. Iโ€™ll start this conversation.

        • DD โ€“ Cool, and to the scheduler this would just look like several small GPUS and we wouldnโ€™t need to do anything special?

        • JC โ€“ Yes, I think thatโ€™s how it works.

        • IA โ€“ you might get 100% GPU util on paper but still get better throughput by using a smaller slice of the GPU - we had this a lot with gmx in the past

    • current stack versions:

      • alchemiscale: 0.4.0

      • neo4j: 5.18

      • gufe: 0.9.5

      • openfe: 0.14.0

      • perses: protocol-neqcyc

      • openmmforcefields: 0.12.0

  • JS โ€“ Register for alchemiscale

  • JS : networks with set higher priority donโ€™t seem to be actioned more than networks with lower priority



Action items

@David Dotson will schedule a whiteboarding session with Jenke, Josh, Irfan, and Ben for network planning with multiple protein target structures

Decisions