2024-01-16 alchemiscale : user group meeting notes

Participants

  • @David Dotson

  • Ian Kenney

  • @Irfan Alibay

  • Jenke Scheen

  • @Joshua Horton

  • Meghan Osato

  • @Jeffrey Wagner

Recording: https://drive.google.com/file/d/1pMazARSRNkTbK4MuxsfG43Buaq-Rmgjc/view?usp=sharing

Goals

Discussion topics

Notes

Notes

  • alchemiscale.org

    • user questions / issues / feature requests

      • JS : what data we can gather / start gathering from simulations on alchemiscale? Just keeping deltaG values from these thousands of simulations seems to leave a lot out. But sharing data risks leaking structures.

      • DD – The ASAP data is restricted to only ASAP personnel access, so there shouldn’t be a data leaking problem. In principle we could store simulation trajectories, but I’m not aware of people wanting to download trajectories (at least not enough to be worth the storage/bandwidth cost)

      • JS – Might you be interested in keeping data to look at like variability in deltaG?

      • DD – So the thought is that these simulation trajectories/deeper details could become open data? I kinda have visibility into some of the details of calcs for debugging, but you’re thinking that there could be an opportunity to publicly share this data and do some sorts of research on it.

      • IA – This task is on the roadmap for OpenFE in the next year. So there may be a GUFE-able solution here.

      • DD – Putting this in the GUFE scope would make sense. Alchemiscale has no concept of many of the details that people would be interested in pulling out. So if this could be implemented in GUFE that would make sense and would keep the relevant details in the same place. JS, could you flesh out a few more use cases here?

      • JS – Imagine this is an ML thing - In a few years we may have a grad student who wants to do a research project - it’d be great to have this database available. So they could look at FE preds, or they could look at the variance between the replicates. Or looking at the feature space, like the chemistry of the ligand or the binding site.

      • DD – The good news is that all that exists serverside. The starting confs+chemistry data is all there. So you can pull that data for any network you can see. As for whether we’re storing every conceiveable feature, we’re not, so we can’t guarantee that every conceieveable grad student project is possible now.

      • IK – You can store GUFETokenizables, so the details there would be available. So those have a pretty small footprint at the moment.

      • DD – So storing GUFE info would pave the way for a lot of possible future data mining. Could also make a “public” scope that you guys could tag your data into uncredentialed users could access.

      • JS – That could be a good idea. Would attract public interest.

      • IA – OpenFE+friends had been thinking abut this. How do we keep from duplicating effort? This is a common discussion, since we might do studies that we want to share with OpenFF. So for us and OpenFF, I think we want all data to be public.

      • DD – Yeah, would be good to figure out how to extract data from alchemiscale and into something like Zenodo.

      • JW –

        • Intersted in everything public

          • IA – If I generate data and Meghan wants to access it, since we have separate OpenFF/FE scopes, it’s kinda painful right now. So public data would be great.

          • DD – With the right access, MO could copy an OpenFE network into an OpenFF scope, and the memory footprint wouldn’t be bad since it’s just a copy of pointers.

          • IA – A lot of what I run is basically stuff that gets sent over to OpenFF. So maybe we want an OMSF scope, since otherwise I might get billed for running stuff like NAGL benchmarks.

          • DD – Right now I think a lot of things are kinda in favor trading now, since OpenFE is running on what is traditionally OpenFF compute. Though it can be hard to balance.

          • IA – Longer term, maybe it’d be easier to, instead of having separate scopes, it could be good to have a single OMSF scope.

          • DD – This would be easy for me. I’ll give access to OMSF/*/* to all current OpenFE and OpenFF users

          • JW – That sounds good.

        • Bandwidth costs?

          • DD – One thing we could do there is rate-limit access.

          • JW – That causes some frustration for BPritchard re: rate limiting people and getting messaged saturday

          • DD – Maybe zenodo-for-all, since they have an existing CDN. I’ll make some action items around this.

        •  

    • compute resources status

      • running with 250 - 350 compute services across Lilac and NRP

      • completed ~26,000 FEC Tasks since 2023.12.21 (less than 4 weeks ago)

      • completed 77,000 FEC Tasks in total since starting operations in April 2023

    • current stack versions:

      • alchemiscale: 0.2.1

      • gufe: 0.9.5

      • openfe: 0.14.0

        • DD – Holding off on 0.15 since it seemed like there were substantial changes/it might interfere with HBaumann’s work

        • IA – Important for alchemiscale, there was an array that we were storing that wasn’t necessary, may have been adding hundreds of MB, should be fixed in 0.15. Some structure analysis stuff. Main thing would be charge correction, now you can do net charge transformations in networks. I’ll ask HBaumann and will let you know whether we want to update.

        • DD – Maybe I was confused with breaking changes in 0.14.

        • IA – That seems likely.

        • DD – … (33 minutes in recording)

        • IA – The protocolresult class has guesses for this.

        • DD – So, can get estimates+reverse energy analysis out of result objects

        • (DD does live demo at 35 mins)

        • IA –

          • https://github.com/OpenFreeEnergy/openfe/blob/main/openfe/analysis/plotting.py

          • Plotting methods are held in the above provided link, so if you pass that through to your selected results you can get your plot.

          • For this protocol, we’d like to keep adding thigns and not removing. However at runtime there are cases where you may not be able to get everything. For example in the back-and-forth analysis there’s sometimes trouble with pymbar if there aren’t enough data points. The other possibility in terms of data availability, since some data is too large, so we’ve had to pull some functionality from working on remote data, but they’d work on local data/netcdfs.

          • DD – Something I’d like us to do alchemiscale-side is to add use of compressed artifacts are rest of protocoldagresults. Right now we’re not compressing the in the object store, only in transit. What we’ll instead do is compress them when they’re created, and they’ll stay compressed until a user pulls them to their own machine. But I’m not sure how large numerical arrays compress.

          •  

        • JS – Will this data always be available?

          • DD – Some are protocol-specific. Some of these don’t make sense for other sorts of protocols.

          • JW – Also another face of JS’s question is data retention. Is thre a plan to move old datasets off hot storage after some time?

          • DD – Data footprint is currently small so we don’t really have a need to move data to cold storage.

          • JW – Some day we may need to limit storage if we store trajectories or we lose JC’s credit card. Would be good to have a data retention policy in place in a year.

          • DD – I’ll add to my to-do list.

          • JS – even though no one was hit by a bus during the last few years, so much FE data generated on FAH during Covid has just been lost in the chaos

        • IA – One thing that OpenFE may do is, since we’re starting to put together experimental protocols for folks, we may have a repo with rapid deployments of new workflows that we want people to use. What could we do to make this compatible with alchemiscale?

          • DD – I’ve been looking more into release automation. Having a high-velocity release repo would be a good motivation to do this. So if you start this repo let me know and I’ll work with you on rapid deployments.

          • IA – I don’t see us doing in the immediate future, but I’ll bring it back up once we start moving toward it.

      • perses: protocol-neqcyc

      • openmmforcefields: 0.12.0

  • DD : impending release 0.3.0 features:

    • the ability to set and get Task priority, set and get AlchemicalNetwork weight relative to others, and set and get actioned Task weights. These give users several levers for control for getting results of greater interest more quickly. Big thanks to Ian Kenney for working on these pieces in detail.

    • vast improvement to AlchemicalNetwork submission and AlchemicalNetworkTransformation, and ChemicalSystem retrieval through smarter serialization via use of keyed dicts, thanks to work by Ian Kenney and myself to solve alchemiscale#216: `AlchemiscaleClient.create_network` scales poorly with increasing `AlchemicalNetwork` size · Issue #216 · OpenFreeEnergy/alchemiscale

      • users should see very fast submission times compared to previously, even on relatively slow internet connections

      • "large" networks (>1000 chemical systems) should be ingestible without issue



Action items

@David Dotson articulate issue for producing archival extracts from AlchemicalNetworks and results
@David Dotson articulate issue for creating publicly-readable Scopes that don’t require authentication
@David Dotson add OpenFE + OpenFF users to omsf-*-*
@David Dotson will begin process of drafting a data retention policy for alchemiscale.org

Decisions