General updates | MT – Met with CC earlier this week. Gave a tour of proteinbenchmark repo on GitHub. Walked me through a lot of the code. I initially approached it from the perspective of how I could help integrate it together, but it became evident that it wasn’t a good path forward. His code currently “just works”, and YAMMBS currently “just works” as well, and there’s so much structurally different that it wouldn’t make sense to sync things together (eg protein benchmarking requires trajectory storage and heavier analysis), and I came out of the meeting thinking that this isn’t worth pursuing. Understand that this may make one-click benchmarking a bit more manual but from my perspective it’s right to push this off to the future. Another incongruity is that, unless to want to go into 100ks of molecules, the small molecule benchmarking will be far lower-resource (a few CPU-days) than a protein benchmarking run (many GPU-days).
|
BW yammbs-dataset-submission demo | BW will post slides hereSlides: MT – Could you elaborate on what error cycling means here? BW – Like, if a user has a typo in their input FF or dataset. JW – Important to distinguish that QCA-dataset-submission error cycling has to do with handling our one-way interaction with an outside server, whereas this will all take place using MT – That makes sense. Do recall that there’s no expectation of caching/reuse of materials from a restart. I’m broadly curious about what sorts of failures we’ll encounter.
LW – Why no live fetching datasets/restriction to use existing datasets? JW – I can share info@openff zenodo credentials - That will help with review MT – What gets uploaded to zenodo? BW – CSVs go into repo, sqlite (zipped) goes into zenodo. Some other LW – Strongly in favor of updating everything associated with the run to Zenodo, and uploading as separate files. MT – The python script that runs everything? MT – It could be good to put the conda env everywhere we can as well. LW – When we’re talking about provenance, are we talking about the zenodo record or the repo itself? I think MT was saying that we should put the full conda env in BOTH zenodo and GH. JW – I think I see two axioms here: Don’t assume that people who find the Zenodo will have access to GH Duplication as a feature - Everything added to GH should go on zenodo, and not everything on zenodo gets added to GH (because of size)
MT – Want to check where the runner script goes - I suggest copying the current version of it to zenodo. Also, I’m concerned about enabling people to do different flavors of experiment. And the number of knobs to turn will always go up. MT – Want reviews on PR 2? MT – Pricing for GH runners is 5-10x on other providers. So if this is expensive we could switch it to one of their providers. BW – LM, does this resolve your use case (where you need to access sqlite database)? MT – Wondering what would get tricky about reproducibility here. Like, if I came by in the future to an old zenodo record, what would prevent that from getting reproduced? BW – Yeah, might need more columns in table. JW – Could go totally overboard and have single file installers made, used, and stored for each run. LW – Could version all the scripts, like main.py, and record verson in table. BW – Could have each row have a link to a git commit/state of the repo. .
MT - The datasets, if we want to use more than just the one dataset over and over again, could be hosted on Zenodo. Then pull down the dataset from Zenodo over API to do each run LW – Re: FF fitting meeting - Is OE going to be a blocker to reproducibility? BW – Maybe. The current dataset was filtered with the most recent OE. LW – So datasets should be labeled with the versions that created them. JW – A later stage of this repo could do dataset filtering and recording in the same style on zenodo. LW – Could even fetch datasets from zenodo
BW – JW –
|