2024-07-09 alchemiscale : dev group meeting notes

Participants

Goals

DD : alchemiscale roadmap
- ~~Q1 : complete “living networks” performance improvements~~
- Q1 : Folding@Home compute services deployed in production
  - ~~finish MVP, with integration test suite by 2024.03 2024.06~~
  - perform FAH tests with volunteers during ~~2024.04~~ ~~2024.06~~ 2024.07
    - public work server up by ~~2024.03.15~~ ~~2024.06.11~~ 2024.07.19
    - confidential work server up by ~~2024.04.01~~ ~~2024.07.01~~ ~~2024.07.31~~ 2024.08.16
- Q2 Q3 : develop Strategy structure, initial implementations
  - aiming to begin design ~~6/26~~ ~~7/10~~ 7/24 sprint, followed by MVP development during ~~July~~ August
- Q3 : enable automated Strategy execution by ~~end of Q3, 2024 (2024.10.01)~~ mid Q4, 2024 (2024.11.15)
IP: feflow needs
- DD: are there any real blockers for a 0.1.0 release? Can many issues in the milestone be resolved over future releases pre-1.0?
alchemiscale development : new sprint spanning 7/10 - 7/22
- aim is to complete 0.5.0, deploy to alchemiscale.org, including openfe + gufe 1.0:
- architecture overview : https://drive.google.com/file/d/1ZA-zuqrhKSlYBEiAIqxwNaHXvgJdlOkT/view?usp=share_link
- coordination board : alchemiscale : Phase 3 - Folding@Home, new features, optimizations, targeted refactors

Discussion topics

Notes

DD : alchemiscale roadmap
- ~~Q1 : complete “living networks” performance improvements~~
  - DD – Completed
- Q1 : Folding@Home compute services deployed in production
  - ~~finish MVP, with integration test suite by 2024.03 2024.06~~
  - perform FAH tests with volunteers during ~~2024.04~~ ~~2024.06~~ 2024.07
    - public work server up by ~~2024.03.15~~ ~~2024.06.11~~ 2024.07.19
    - confidential work server up by ~~2024.04.01~~ ~~2024.07.01~~ ~~2024.07.31~~ 2024.08.16
  - DD – Initial testing surfaced some issues I haven’t anticipated - https://github.com/openforcefield/alchemiscale-fah/pull/7 - In interfacing with the F@H prgc system, we made a decision about each task being treated as a clone, but now it’s looking like each protocolunit should be a clone.
    - JC – Clarify? Each run/clone can have a different system and settings
    - DD – IDea is that a transformation corresponds to a run. Any given transformation would have an arbitrary number of tasks, which map to clones. But with our current protocoldags, there’s a single simulation unit, which maps to clones. But in GUFE a protocoldag can have multiple simulationunits. So it’s not a fundamental problem but it’s something I realized needs to be changed today
    - JC – The concern is, if you have one transofmration mapping to one run, we’re going to run out of those quickly going across projects, and that will cause us to run out of IDs, since we’re limited at 65k (16? bit int). But if you do runs x clones you could have 65k ^ 2, which should be plenty. I believe that F@H is limited to 65k runs? in it lifetime.
    - DD – Can we proceed on the current tack?
    - JC – Yes, this should be OK in the short term, but talk with joseph about the longer term.
    - DD – Ok. For a given transformation, you might have…
    - JC – Yeah, I’m just saying in the long run you might be best with each sim looking up the next available ID to avoid running out of IDs.
    - DD – Ok, I’ll implement this in the way I described now, but I’ll see if we start having a problem with ID exhaustion
    - JC – You should follow up with joseph before it becomes a problem.
    - DD – Ok, will do.
    - JC – I believe project, clone, run, gen are each 16bit ints. Hopefully unsigned!
    - …
    - DD – For testing both private and public work servers, I have an easy test case (maybe tyk2) and a harder one (some in mind, but something to shake out issues with size)
    - JC – Would it be better to run with other items from the OpenFF benchmark set? They’re better behaved and smaller.
    - IP – I support running the OpenFE/FF PLB set. Lots of chatter around failure cases at conferences with these so there should be plenty. I’m working on documentation for how to prepare/submit these.
    - JC – We’ll need two types of pipeline - One feeding in PLB systems, and another coming in from alchemiscale.
    - DD – If it works using the bare OpenFE tools, then it is probably a sufficient test for proceeding to production. For the ASAP tooling, we’ll need to add in options for noneq cycling. So I understand what you mean JC, but it should be sufficient if the OpenFE tooling is working.
    - JC – Ok, good to remember that the PLB already has charges and edges defined, which other targets won’t. So I don’t know whose responsibility it is to ensure that less prepared stuff can be ingested.
    - DD – Ok, I’ll work with JS to figure this out. I’ll be a little throughput constrained in testing since I’ll need to use volunteer compute from the F@H slack.
    - JC – Can run F@H workers on lilac as well. Multiple ways to spin this up. I have a script I’ll share with you that submits jobs to lilac where each job has a F@H ID.
    - DD – That’d be great, then I’d have visibility directly on both ends of the pipe. How’d you install the client?
    - JC – There’s a linux client that’s been floating around a few years. I think it’s a debian installer/tarball that you can extract the executable from.
    - DD – This would be great, anything here would help a lot.
    - JC – Great, I’ll put this info up on slack
- Q2 Q3 : develop Strategy structure, initial implementations
  - aiming to begin design ~~6/26~~ ~~7/10~~ 7/24 sprint, followed by MVP development during ~~July~~ August
  - DD – Pushed this back since it’s a lower priority than F@H
  - JW – I’m glad that you’re prioritizing F@H - Our ad board has been seeing this slipping so it’ll be good to wrap it up.
  - JC – Kendall is working on things adjacent to this as part of his PHD. Looking at alternative strategies.
  - DD – Can I use Kendall as a resource to develop strategies?
  - JC – Yes, it’s a major component of his thesis. Jenke is helping guide him too.
  - (stuff about strategy implementation, see recording ~25-28 minutes)
- Q3 : enable automated Strategy execution by ~~end of Q3, 2024 (2024.10.01)~~ mid Q3, 2024 (2024.11.15)
IP: feflow needs
- IP – I saw your comments on some PRs - answered several and merged one. We met with OpenFE, now I’m working on feflow #49 on charge handling. I need to write a few test cases that handle all the partial charge methods and options. We’re doing it a little differently than the OpenFE protocol. But once that’s done we can merge this and it’ll close 3 or 4 issues on the milestone.
- IP – The CI is failing because we dropped support for GAFF because of ambertools/openmm incompatibility. This is hard to separate cleaning since we inherited so much from perses.
  - JC – Which part specifically is causing the problem? Is it using GAFF through systemgenerator?
  - IP – I think so. There’s an explicit block in OMMFFs.
  - JC – Is there a fix in for for?
  - MH – It’s in the pipe. You can get things working by installing openmmforcefields 0.12. There’s an openmm release coming that will resolve this
  - JW – My understanding is that the OpenMM 8.1.2 release will unblock this.
  - JC – It’s not a hard blocker, we’re making a workaround.
  - … (fast discussion, see recording 33-40 mins)…
  - DD – Is feflow good for release after this?
  - IP – Want IA’s signoff. I’m meeting with him monday at 11 AM US east.
  - DD – I’ll be there.
  - IP – In the meantime, you can use feflow main, right?
  - DD – Yes
- DD: are there any real blockers for a 0.1.0 release? Can many issues in the milestone be resolved over future releases pre-1.0?
alchemiscale development : new sprint spanning 7/10 - 7/22
- aim is to complete 0.5.0, deploy to alchemiscale.org, including openfe + gufe 1.0:
- architecture overview : https://drive.google.com/file/d/1ZA-zuqrhKSlYBEiAIqxwNaHXvgJdlOkT/view?usp=share_link
- coordination board : alchemiscale : Phase 3 - Folding@Home, new features, optimizations, targeted refactors
- (complex technical discussion about task extension, see recording ~42-55 mins)
- DD – Also in progress is alchemiscale 277 - user settable restart policies. IK is working on this.
- IP – Possibly related - I’m seeing a lot of tests failing for ASAP workflow. I asked what tasks these were since I’m concerned they’re an issue with hardware instead of data/software.
- DD – Currently HMO is handling project support for asap.

Action items

David Dotson will confirm with Joseph Coffland that the hard limits for PRCG are 16-bit integers large; may then make sense to increment up through RCs based on ProtocolUnits instead of invoking Transformations for RUNs
David Dotson follow up with Hugo on what kinds of random errors he is seeing; assess if this is a cluster issue

2024-07-09 alchemiscale : dev group meeting notes

Participants

Goals

Discussion topics

Action items

Decisions