2024-05-29 Meeting notes

 Date

May 29, 2024

 Participants

  • @Matt Thompson

  • @Alexandra McIsaac

  • @Lily Wang

  • @Brent Westbrook (Unlicensed)

  • @Jeffrey Wagner

 Discussion topics

Item

Notes

Item

Notes

Run provenance

  • https://openforcefieldgroup.slack.com/archives/C03T3LLVC1J/p1716943450143429

  • LW – Generally the things we want would be versions of packages and input scripts.

  • LW – Figshare says it tops out at ~20GB, zenodo has a limit as well.

  • JW – we have 3 areas of need in the org

    • YAMMBS

    • Alchemiscale

    • QCArchive

      • MolSSI might delete stuff if a file size limit is reached?

    • Therefore NOW is a good time to think about storing/managing/accessing/working with this data in as similar a way as possible

  • What are the YAMMBS-related user stories?

    • A user story is usually written from the user's perspective and follows the format: “As [a user persona], I want [to perform this action] so that [I can accomplish this goal].”

    • A scientist needs to grab an (input) data set, which may be used several times in development iterations

    • A reviewer wants to to re-run a benchmark as reported by a paper

    • MT - As a YAMMBS developer, I want to re-run an analysis on an existing data set and compare the results to previous results

    • MT - As a person curious about force fields, I want a web interface that compares physical properties of different force fields

    • LW – As someone researching FFs, I want to compare different force field fitting experiments that have been computed with the same versions and restrictions

    • LM – I benchmarked all versions of Sage and Parsley and now everyone in the future can use my results to compare to their new analyses without needing to generate a bunch of new data.

    • BW – I’d love “CI for forcefields” - like, when I get a FF out of ForceBalance, I’d like a job to be dispatched and to receive summary graphics about whether it’s good or not.

    • LM – I want to use my sqlite store in other analyses, like to drill down into problem molecules. So I could just upload my notebook with the assumption that other people could grab the weighty sqlite store

      • MT – Also the dimension of time here - Like in 2 years, people might try to rerun and fetch the old dataset.

    • LW – I want to easily add datasets to my benchmarks in comparisons, so if I initially benchmarked on the OpenFF industry dataset I could append a new dataset for more summary graphics like Brent mentioned

    • BW – Could be good to have serializable dataset models, where eg we could dump out a subset of just our molecules of interest to json/csv

    • Have standard formats/inputs where there can be a rolling database of known “bad” entries that are filtered out each time.

      • BW – This could be handled by maintaining a standardized input dataset.

      • LW – Kinda, but it’d be good to have the full dataset in there ahead of time so folks can see what we’re filtering out.

      • MT – I could see two approaches here - Either have a filtering layer in YAMMBS or have an evolving dataset outside YAMMBS.

    • JW – compatibility of YAMMBS with datasets will change – story is something like: I want to load an old dataset with a new version of YAMMBS, and I get either an error message or instructions on how to deal with it

    • LW – as a maintainer of YAMMBS, I could migrate databases to a new format and tag them on Zenodo so they’re loaded from compatible YAMMBS versions

    • MT – What needs to be our ability to scale to large datasets?

      • LW – Hard to answer, I figure we can eventually put things on the cloud if scale becomes overwhelming. We can discuss larger uploads with Zenodo/Figshare.

    • JW – Would it be reasonable to say “we should be able to handle 100x our largest benchmark, manipulate it on a macbook, and fit within the zenodo/figshare limit”

      • (General) – This is a reasonable scale. Currently the industry dataset sqlite db is 180 MB

    • LW – I’d like the ability to load and concatenate datasets from multiple zenodo uploads.

    • MT – what are use cases external users might come to us with? e.g. our collaborators, us in the future

    • JW – As an external user, I'd want to pull up a list of all benchmarking runs and search through them to find a benchmark that’s useful to me.

    • LW – As an external user, I want to run a benchmark like what OpenFF does (and possibly store it on OpenFF infrastructure)

    • JW – as an external user, I want to look at one observation but not pull down the entire dataset (I personally would reject this user story unless it’s trivial in our implementation, but I’m adding it here to record that we’re discussing it)

    • JW – guessing the science team will want to upload on the scale of 10-100x/year. It would be nice to have the upload be as easy as possible (e.g. a cli tool that automates the upload and fills in the description/zenodo metadata)

      • MT – as a FF developer, I want to upload the benchmarks I just ran to the storage destination, as easily as possible (preferably automated)

    • LW – Does zenodo allow deletion?

      • JW – It’s an “email us” situation, so I think that’s roughly “no”

    • JW – We should be able to change the URL where data is fetched from (like, if we’re fetching datasets by name and not URL) in case our provider (figshare/zenodo) goes out of business.

  •  

  •  

  •  

  •  

  •  

  •  

General updates/discussion

  • numpy.core._exceptions._ArrayMemoryError: Unable to allocate 1.09 PiB for an array with shape (153822853563950,) and data type float64

    • MT – Good find

    • LM – I’ll look into this

    • LW – some kind of overflow…?

    •  

Trello update

 

 Action items

 Decisions