2024-09-10 alchemiscale : user group meeting notes

Participants

  • Irfan Alibay

  • Hannah Baumann

  • John Chodera

  • @Irfan Alibay

  • @Iván Pulido

  • @Matt Thompson

  • @Mike Henry

  • Ian Kenney

  • Jenke Scheen

 

Meeting recording: https://drive.google.com/file/d/1wSTokzb8fONzKJrVRut_Jrkw_UmF5ZXb/view?usp=sharing

Goals

  • alchemiscale.org

    • user questions, issues, requests

    • compute resources status

    • current stack versions:

      • python 3.12

      • alchemiscale: 0.5.0

      • neo4j: 5.22

      • gufe: 1.0.0

      • openfe: 1.0.1

      • feflow: 0.1.0

      • openmmforcefields: 0.14.1

      • openmm: 8.1.2

  • JS : feflow.NonEquilibriumCyclingProtocol - make it tolerant of `ProteinComponent`s with same topology, different coordinates between `ChemicalSystem` in `stateA` and `stateB`?

  • IA: Status of zero value ddGs from JS

  • IA: Gathering performance data (low priority)

Discussion topics

Notes

Notes

  • alchemiscale.org

    • user questions, issues, requests

      • JS--Tagged DD in an issue a few days ago - would love feedback. https://github.com/asapdiscovery/asapdiscovery/issues/1196 - We have a CLI call to stop a network, but I noticed that only tasks with waiting status were being de-actioned. But the running ones don’t stop. The other part is that I used the alchemiscale API to manually go through all the status running tasks and set them to waiting. Interestingly, those tasks DO go to waiting state, but the workers don’t get cleared. Is this something that isn’t implemented/am I doing something wrong?

        • DD – Right now what we deploy on conventional compute is the SynchronousComputeService. It’s called this because it runs protocoldags in-process. It runs everything in the same process as the worker itself. The worker has a separate thread that does heartbeats. So it will continue running that protocoldag until it fails. So there’s no way for the server to stop the worker from finishing the task. There are ways we could go about changing that behavior, but it might not make sense for it to go into synchronouscomputeservice. In the longer run I’m planning for us to have more types of computeservices. But I don’t think this is really possible given our current approach, until we roll out more advanced compute services.

        • JS – That makes sense. Glad I wasn’t missing something. It’s a bit messy as-is since I’ll submit a large number of jobs and if I have to un-action them it will take a day-ish to clear. So it would be great to have a way to stop these tasks earlier.

        • DD – Could implement a separate thread to occasionally check in with server about the status of its current task. I’ll have to give this more thought but we may be able to do something here. Responsiveness on F@H may be a different story as well. (details about possible implementations)

        • IP (chat) – I don't know if it's related to this, but just wanted to note that if you try to stop openmm when it's minimizing (or potentially in some other stages) it does not respond until it's finished with it

        • DD – … Could still have synchronouscomputeservice run things in series, but …

        • IP – OpenMM minimization doesn’t answer to SIGTERM. But SIGKILL should work.

        • DD – So JS, if we have compute services query server and ensure that the task status hasn’t changed periodically, would that solve your use case?

        • JS – Yes

        • DD – I’ll add as an action item.

      •  

    • compute resources status

      • DD – 1 task running now, so lots of compute available.

      • JS – I’m actually trying to stop that task.

    • current stack versions:

      • python 3.12

      • alchemiscale: 0.5.0

      • neo4j: 5.22

      • gufe: 1.0.0

        • IA – openfe / gufe releases will be after we fix the large results object issue. This may bring along the ability to do different box sizes too, if that's of interest to anyone.

      • openfe: 1.0.1

      • feflow: 0.1.0

        • DD – IP, you’re working on a 0.2 or 0.1.1?

        • IP – Yes, after this week we’ll probably have a release.

      • openmmforcefields: 0.14.1

      • openmm: 8.1.2

  • JS : feflow.NonEquilibriumCyclingProtocol - make it tolerant of `ProteinComponent`s with same topology, different coordinates between `ChemicalSystem` in `stateA` and `stateB`?

    • JS – So for any edge in the system, the protein could have a specific conf for that ligand/transformation. This would help account for variability in binding sites. DD had mentioned that there’s some checking to ensure that the protein is the same in each network/edge, but IA had said that it should be possible

    • IA – I think I'm missing a bit of the user story. For noneqcyc, the cycling is directional, so you can’t switch the protein mid-cycle. So it seems like you’d need to do a cycle each way for the protein conformers. If that’s the case, I don’t understand… (something detailed, see recording ~26 mins)

      • JS – I think you’re getting this right, maybe I’m confounding things by using noneqcyc as the example. Could just as well use hrex

      • IP – Maybe this is enoguh that we just get noneqswitching implemented such that you can get from A to B with your A conf, and B to A with B conf, and infer something from there. I agree with IA that it’s a matter of doing cycles correctly…

      • JS – Two questions -

        • It sounds like we can do the base form already - we cna make a protein conf associated with each ligand, and it’s an implementation detail in alchemiscale that blocks it.

          • DD – Kinda, there will be an assertion that fails

          • IA – When we talk about networks we should talk about transofrmations, not nodes. Technically with different protein confs this would cause disconnected networks.

          • DD – This problem could become bad - something about this picture:

          • DD – (There’s some trick? that might allow getting around this explosion problem)

          • IA – (I don’t think that trick will work)

          • IP – (this may not be scientifically sound. If the protein conf needs to change, it would change during the simulation)

          • IA – Vyutas showed in a paper that you’ll need to do forward and backward transformations, and the explosion still exists.

          • JS – But it’s the same protein, different confs

          • IA – Right, but you’d need to do each direction with each conf, you’d still have the explosion.

          • DD – But you don’t ahve an explosion of nodes

          • IA – Right, but you have an explosion of edges, and you’ll need to separate them,

    • DD – link to ProteinComponent checking:

      • Equality is checked, defined as having identical GUFETokenizable hash, and the hashed material includes coordinates. But scientifically I think it could make sense

    • DD – JS and I will work together to open an issue/user story on feflow about this.

      • IA – Might make sense to put the issue on the GUFE tracker rather than feflow

      • DD – I’ll start it on feflow, but we can migrate to gufe later.

      • IP – To add the reason why I added this check - When you create the network, the node is a chemical system that define it. If the chemical system is different then it can’t be in the same node.

      • DD – Right, the ligands were already different, but now the protein would also be different.

      • IA – Part of this issue is “what a chemical system represents”. If you go from apo to holo there’s a free energy cost associated. You can pretend its 0 and just plow ahead, but that’s not sound.

      • … (see recording ~38 mins)

      • JS – IP, would this be hard to tack on to switching prtocol?

        • IP – Having a meeting with IA and HB this week about this, should work for your needs. I’ll update you on how this goes.

    • DD – If the protein coords vary slightly, such that their free energy is about the same, is there value to using the different coords?

      • JS – I think so, this is part of living networks. So we want to be able to feed info back in as more information is discovered during a campaign.

      • DD – Ok, I’ll take primary on writing up the user story and pass it by JS before posting.

  •  

  • IA: Status of zero value ddGs from JS

  • IA: Gathering performance data (low priority)

    • We’re getting reports from industry partners running benchmarks that they’re getting wildly different results/performance. Was wondering if there’s some way to gather this data via alchemiscale where we could see this in aggregate, especially (sim time per wall time).

    • DD – I think we may have this in provenance -

    • records start time, end time, system resources, etc. Triaged for next major release (0.6). Not sure that I’ll get to it soon though.

    • IA – Gotcha, not a huge priority right now. The problem we’re seeing is that I can run on my local GPU and get good ns/day, and a partner will run on an equivalent GPU and get bad ns/day.

    • JS – Did you check that they were using ELF10 charges?

      • IA – They’re pre-charged

    • IP – Did you check real-time yaml file that the sampler outputs?

      • IA – Yes, that’s what they’re reporting. I suspect we’re getting similar issues to what Ivy Zhang saw.

    • DD – Is this with hrex? Is there some stochasiticity with these workflows?

      • IA – Maybe, need to dig into it. Could be that consumer cards vs enterprise ones are overclocked differently. Or could be file system slowness (since hrex does a lot of file I/O)

      • IP – I can help check whether it was the IZ issue, since I worked on that.

    • DD – IA, could you chime in on alchemiscale #106 with this info, and any specific info that seems important about host.

      • IA – Yes, can do

    •  

    •  

    •  



Action items

@David Dotson will articulate approach for SynchronousComputeService to drop Task claim when Task no longer actioned on any AlchemicalNetwork
@David Dotson will create user story on feflow for how to handle ChemicalSystems with ProteinComponents with (slightly) differing conformations spanned by a Transformation, for either NonEquilibriumCycling or NonEquilibriumSwitching
@Irfan Alibay will add host information needs for performance troubleshooting to alchemiscale#106

Decisions