2024-08-28 Meeting notes

 Date

Aug 28, 2024

 Participants

  • @Jeffrey Wagner

  • @Brent Westbrook

  • @Lily Wang

  • @Alexandra McIsaac

 Discussion topics

Item

Notes

Item

Notes

General updates

  • MT – Met with CC earlier this week. Gave a tour of proteinbenchmark repo on GitHub. Walked me through a lot of the code. I initially approached it from the perspective of how I could help integrate it together, but it became evident that it wasn’t a good path forward. His code currently “just works”, and YAMMBS currently “just works” as well, and there’s so much structurally different that it wouldn’t make sense to sync things together (eg protein benchmarking requires trajectory storage and heavier analysis), and I came out of the meeting thinking that this isn’t worth pursuing. Understand that this may make one-click benchmarking a bit more manual but from my perspective it’s right to push this off to the future. Another incongruity is that, unless to want to go into 100ks of molecules, the small molecule benchmarking will be far lower-resource (a few CPU-days) than a protein benchmarking run (many GPU-days).

    • JW – Agree with not coupling them right now. The question of how to network/leverage lots of GPUs is way more complex than having everything runnable behind one wrapper.

BW yammbs-dataset-submission demo

  • Slides:

  • MT – Could you elaborate on what error cycling means here?

    • BW – Like, if a user has a typo in their input FF or dataset.

    • JW – Important to distinguish that QCA-dataset-submission error cycling has to do with handling our one-way interaction with an outside server, whereas this will all take place using

    • MT – That makes sense. Do recall that there’s no expectation of caching/reuse of materials from a restart. I’m broadly curious about what sorts of failures we’ll encounter.

  • LW – Why no live fetching datasets/restriction to use existing datasets?

    • BW – Saves time so we don’t have to do dataset fetching repeatedly - Better for provenance and runtime.

    •  

  • JW – I can share info@openff zenodo credentials - That will help with review

  • MT – What gets uploaded to zenodo?

    • BW – CSVs go into repo, sqlite (zipped) goes into zenodo. Some other

    • LW – Strongly in favor of updating everything associated with the run to Zenodo, and uploading as separate files.

    • MT – The python script that runs everything?

      • BW – Oh, right, we should upload that as well, as well as repo version.

    • MT – It could be good to put the conda env everywhere we can as well.

      • BW – Good idea too.

    • LW – When we’re talking about provenance, are we talking about the zenodo record or the repo itself? I think MT was saying that we should put the full conda env in BOTH zenodo and GH.

    • JW – I think I see two axioms here:

      • Don’t assume that people who find the Zenodo will have access to GH

      • Duplication as a feature - Everything added to GH should go on zenodo, and not everything on zenodo gets added to GH (because of size)

  • MT – Want to check where the runner script goes - I suggest copying the current version of it to zenodo. Also, I’m concerned about enabling people to do different flavors of experiment. And the number of knobs to turn will always go up.

    • BW – Yes, could upload the current version of main.py and other important files.

  • MT – Want reviews on PR 2?

    • BW – That’d be great.

  • MT – Pricing for GH runners is 5-10x on other providers. So if this is expensive we could switch it to one of their providers.

  • BW – LM, does this resolve your use case (where you need to access sqlite database)?

    • LM – I think so, will need to get my hands on to test.

  • MT – Wondering what would get tricky about reproducibility here. Like, if I came by in the future to an old zenodo record, what would prevent that from getting reproduced?

    • BW – Yeah, might need more columns in table.

    • JW – Could go totally overboard and have single file installers made, used, and stored for each run.

    • LW – Could version all the scripts, like main.py, and record verson in table.

    • BW – Could have each row have a link to a git commit/state of the repo.

    • .

  • MT - The datasets, if we want to use more than just the one dataset over and over again, could be hosted on Zenodo. Then pull down the dataset from Zenodo over API to do each run

  • LW – Re: FF fitting meeting - Is OE going to be a blocker to reproducibility?

    • BW – Maybe. The current dataset was filtered with the most recent OE.

    • LW – So datasets should be labeled with the versions that created them.

    • JW – A later stage of this repo could do dataset filtering and recording in the same style on zenodo.

    • LW – Could even fetch datasets from zenodo

  • BW –

    • Get OE license working

    • Do a test run

    • Update zenodo upload to include the provenance we talked about today

    •  

  • JW –

    • On lastpass I’ll share zenodo credentials

    • At next one-on-one with BW I’ll give access to bigger runners

  •  

 

Trello

https://trello.com/b/dzvFZnv4/infrastructure

 Action items

 Decisions