2024-02-13 alchemiscale : user group meeting notes

Participants

  • @Mike Henry

  • Meghan Osato

  • Ian Kenney

  • Jenke Scheen

  • Hannah Baumann

  • @Joshua Horton

  • @Jeffrey Wagner

  • @David Dotson

  • @Matt Thompson

Goals

  • alchemiscale.org

    • user questions / issues / feature requests

    • compute resources status

    • current stack versions:

      • alchemiscale: 0.3.0

      • gufe: 0.9.5

      • openfe: 0.14.0

      • perses: protocol-neqcyc

      • openmmforcefields: 0.12.0

  • MH – We’re releassing a 1.0rc for Gufe and OpenFE soon likely this week. Want to coordinate rolling out.

  • MH – I’ve been spinning up a lot of jobs on IRIS. Hopefully this can help clear the backlog

Discussion topics

Notes

Notes

  • alchemiscale.org

    • user questions / issues / feature requests

      • JS – We (ASAP alchemy) were trying to build a few new implementations, in particular one where we STOP a network. So if most of a network is done but some runs are still going, we want to be able to kill that. We can do that by setting their status to “deleted”, except then if we do want to run them later we can’t easily rerun them. Instead we can set them to “error”, and I tried that, and it seems like that should be possible, but I get a 505 error. I’ve opened an issue on alchemiscale github. So it may be helpful to have a new status category like “cancelled”.

        • DD – Re: “error” status, that’s something that’s reserved only for real errors, so only a compute service should ever be setting that. For “cancelled”, we do have a concept for that, but it’s orthogonal to status. “cancelled” is the opposite of “actioned”, and this is recorded in a field other than “status”. There’s a special method for that.

          • JS – Will cancelling a task stop it if it’s actively running?

          • DD – Let me check…(looks at source code) Tasks that are cancelled using this method aren’t actioned. Since one task can be affiliated with several networks, it doesn’t make sense to cancel it without knowing all the networks it might be part of. So cancelling a task is network-specific. Later, you can look at the network and use methods like get_actioned_tasks that will let you piece together which tasks aren’t actioned if you want to restart them.

        • (see recording @17 mins) JS – How to check running tasks on all networks? How do I query networks with only running/waiting tasks?

          • DD – One way to do this is to use the query_tasks method, filling in your scope and specifying the desired the desired status. Then you could figure out the networks using the get_task_networks on the things that are returned. But that involves a bit of back-and-forth with the server. What’s your end goal with this info?

          • JS – We want to have a way to prioritize newly submitted networks over the current ones. I don’t want to have to figure out where it should be relative to the other task weights, I just want to give it either the highest or the lowest weight. But I only care about the actually running/waiting jobs, since the ones that are complete/errored won’t be competing for compute.

          • DD – A few ways I could do this… Could you write this up as an issue, including what you want to achieve, and then we can iterate on the cleanest solution that will also support related use cases.

          • … (see recording ~23 mins)

        •  

        •  

        •  

    • compute resources status

      • We were idling earlier today, but an OpenFE submission just landed. Also I saw some jobs going through on Lilac and IRIS.

    • current stack versions:

      • alchemiscale: 0.3.0

      • gufe: 0.9.5

      • openfe: 0.14.0

        • DD – I haven’t gotten around to testing 0.15, which I said I’d do.

        • MH – May be better to migrate straight to 1.0, so there’s only one round of breaking changes.

        • HB – Agree. And once 1.0 is out and it’s on alchemiscale, we’re planning to run a lot of jobs to benchmark.

          • DD – Has throughput been sufficient?

          • HB – Over the christmas holidays throughput was slow, but I think this was because lots of folks were submitting long-running jobs for the holidays.

          • DD – Looks like we’ve completed 91k tasks since we started in April.

        • DD – Can I depriotize 0.15 and just wait for 1.0?

          • MH – Yes.

          • HB – Yes, 1.0 is expected in the next two weeks.

      • perses: protocol-neqcyc

      • openmmforcefields: 0.12.0

  • MH – We’re releasing a 1.0rc for Gufe and OpenFE soon likely this week. Want to coordinate rolling out.

    • (covered above)

  • MH – I’ve been spinning up a lot of jobs on IRIS. Hopefully this can help clear the backlog

    • DD – Actually, I’m seeing that we’ve cleared the backlog.

    • MH – Just submitted more jobs.

    • JW – What is IRIS?

    • MH – New cluster at MSKCC. I seem to be the only user.



Action items

Decisions