2025-03-11 QCA dataset submission meeting

2025-03-11 QCA dataset submission meeting

Participants

  • @Anika Friedman

  • @Jennifer Clark

  • @Lily Wang

  • @Jeffrey Wagner

 

Discussion topics

 

Item

Notes

 

Item

Notes

Update Dataset Tracking

Project Board; Slides

  • Complete PR 427: “OpenFF Cresset Additional Coverage TorsionDrives v4.0“

  • Complete PR 428: “SPICE Dipeptides Partial Relaxation Dataset v4.0“

  • Started PR432: “OpenFF Protein PDB 4-mers v1.0“

  • Close to Submitting PR for metals. I think we can use QDS

  • LW: would like to check in on the protein PR since it’s so close to completion! There’s a couple errors.

  • AF – Re protein ones, I was expecting a nonzero number of errors. Can take this into consideration for dataset expansion

    • (General) – Agree

    • (General) – We’ll keep this running, but JC has the authority to unilaterally EOL these once progress stops.

  • JW – I propose end-of-lifeing LipidMAPS

    • JC – No new completions for 7 days in any categories

    • (General) - JC will end-of-life LipidMAPS

  • Lipid benchmark

    • (General) – We’ll keep this running, but JC has the authority to unilaterally EOL these once progress stops.

    •  

QDS Issue

Retagging CI does not Retag … when “_mw“ feature is not used?

  • Only difference is that ds.modify_records() without “mw_“ uses the specification_names keyword and with “mw_“ does not.
    I traced it back and didn’t find a difference in record retrieval

  • Last time with Cresset we assumed it was a caching issue which should only affect what I see locally… but something it in the timeline of what I did, didn’t seem right. Even here, I ran the submission CI, waited a bit, ran the reprioritize retag and not only did I not see the changes locally, when I submitted the job nothing was picked up with that tag.

  • LW: odd, I think the PR I quickly put through was retagged through CI. Maybe it needs to be manually run?

  • LW - To reiterate - manual retagging sporadically works, CI retagging never works?

    • JC – When it’s a new dataset, retagging CI doesn’t work for me, but manual does.

  • JW – Maybe tasks are unable to be retagged if the server thinks that there’s a worker working on them (like, if a worker claims the task and then is shut down ungracefully, the server might wait hours to ensure the worker is gone)

    • LW – Will workers claim no-tag tasks?

    • JC –

  • JC – I observed a lot of the weird behavior last friday

    • LW – I saw mine work last Weds

    • Maybe a recurring issue Fridays at MolSSI?

  • We will ask BP:

    • Whether there’s anything that could cause a python API call to retag a submission to return successfully, but not have task available for workers

    • If there’s any sort of backups/regular process that runs Fridays that might change server responsiveness.

  • We’ll icebox this ticket

  •  

Update GitHub Actions for QDS: Avoid qcsubmit in lifecycle

The TM complexes won’t run through QCSubmit because neither toolkit is reliable. We can easily bypass validation, and I think we can alter the GitHub Actions LifeCycle to avoid QCSubmit easily.
If we import the dataset.json into a QCFractal dataset instead of QCSubmit, and add a bz2 deserializer to QDS, it should be straightforward to make this change.

See notes

  • LW: Having had some time to think about this, I think I'm in favour of spending the time to make edits to QCA-DS CI to use its error cycling and retagging functionality with QCFractal datasets via an alternative sqlite pathway, and mostly ambivalent on whether or not to rewrite the validation function vs turn it off.
    It's worth thinking about why we might want to use QCA-DS in the first place and what we want to include in it; at the end of the day with Jen's machinery it's not essential to use its error cycling and retagging functionality, but instead the repo (from my POV) is:
    a) convenient
    b) allows external users to track what's going on
    c) holds a record of what was done and at least the input dataset.
    For this last reason I don't think it's very useful to have a stub JSON dataset that just contains the dataset name and type for error cycling purposes; dumping the entire QCFractal object seems more useful to me.
    Unless there's a good reason otherwise, I think treating QCFractal datasets as sqlite files and QCSubmit as jsons would be much clearer then having everything be a JSON. Alternatively if people value how readable a JSON is, I'd be in favour of a clear file-pattern that makes it obvious when something is QCSubmit vs QCFractal, like qcf-dataset*.json* . If I understand the notes correctly this also makes parsing the object slightly easier
    Finally some validation does seem useful to me, if only checking metadata fields, elements, specification and so on, but also sounds like it would be a lot more work/code to add this functionality in just for QCF dataset objects. Long-term if this became a pathway for others to submit datasets bypassing qcsubmit we'd want this though.

    After this option, IMO the next best would be to do everything off QCA-DS to save effort on the required code and maintenance. This seems like a suboptimal choice though.
    a) we would like a record of this dataset somewhere eventually (although it doesn't have to be QCA-DS)
    b) eventually Jen will have additional datasets to work through and the QCA-DS interface is a nice one that allows others to review, comment and participate

  • A lot to think about will follow-up at a later date.
    Meeting adjourned

MolSSI Info / Align Priorities on MolSSI Asks

2025-03-04 QCA User Meeting

New from last QCAUM meeting:

  • Dataset entry/spec/record copying! Doesn’t actually duplicate records, just links to the existing one in the new dataset. Also, Records and specifications can’t already exist in destination dataset (can’t have same name)

    • This will make compiling the Sage datasets easy, haven’t tested yet.

  • Cool QCBrowse demo!

 

Update on clean force field releases

Recent QCFractal update should be great.
Josh showed me the ropes with docker images
Should we have a docker in each zenodo repo, or make a docker image instance in zenodo that is referenced and periodically updated.

 

Old Issue of the Week

One-click QCArchive data (8/2019)

  • Basically a collaborator was overwhelmed with the number of datasets and their inability to search them easily. The consensus appears to be that adding tags to differentiate OpenFF data from others is the solution. Then left hanging….

BONUS: Automating QCArchive dataset submission (9/2019)

  • John discusses what appears to be a predecessor to QCSubmit

BONUS: Add collection tags to lifecycle (8/2020)

  • David suggests that CI updates PR tags as datasets move through the lifecycle

 

Action items

Decisions

 

Related content