2025-03-25 QCA dataset submission meeting

2025-03-25 QCA dataset submission meeting

Participants

  • @Jennifer Clark

  • @Jeffrey Wagner

  • @Lily Wang

 

Discussion topics

 

Item

Notes

 

Item

Notes

Update Dataset Tracking

Project Board

  • Completed PR432: “OpenFF Protein PDB 4-mers v1.0“

  • Started PR 434: TM PDB CCD

  • Trying to start PR 440: Chodera tmQM

    • Seems to be too big for MW partitioning

  • LW: Retag the Chodera tmQM dataset locally

  • Let the errors for PR434 go through error cycling and see what happens

QDS handling of non-QCSubmit dataset.

New module in a PR for QCFractal: qcportal.external.scaffold.py to create a json of the dataset and recapitulate a json back into a QCFractal dataset.

Need tests in CI, and background_add_entries (see next topic) before merging.

QDS Lifecycle:

Validation

  • Current state: a GitHub label “validation-off” will skip the validation CI.

  • Future work: Will write validation for QCPortal use of scaffold

Queued Submission

  • Current state: CI will recognize scaffold.json files, but will not attempt to process or submit them.

  • Future work: Will adapt CI to process scaffold.json files and submit

Error Cycling

  • Functions as expected with scaffold.json

See notes on thought process leading to this point.

 

MolSSI Info / Align Priorities on MolSSI Asks

https://docs.google.com/document/d/1JvU1s5I9_jvEi8mtg6PsEuUdy3TJRBtUKFZxzlye9Co/edit?tab=t.0#heading=h.aetub4b1g91g

New from last QCAUM meeting:

  • Change in “metadata” field to “extras”, meeting left this ambiguous as to what is coming.

  • Projects are almost done.

    • These don’t directly align with our use of QDS but would allow us to use active learning (discussed in onsite)

New developments:

  • TM calculations will often use a different basis set for a metal versus the remaining elements. Ben says this is not supported. Luckily Genentech’s favorite model chemistry does not follow this common DFT implementation, but this will limit us in what we we can try / will require more basis functions than if we could have multiple. Psi4 supports, not MolSSI.

    • Could go into additional_keywords like constraints are.

  • Chodera tmQM dataset ~675k structures times out before all entries are added.

    • Ben is making an update to improve the efficiency of the code and add a background_add_entries method to resolve.

Requests:

  • It sounds like their migration to pydantic 2 is holding up our versioning? Should this be a request?

  • JAC: Likely a version change this week

  • JW: We aren’t blocked by QCA using pydantic 2 that I know of, we need to check with Matt.

Update on clean force field releases

Recent QCFractal update should be great.
Josh showed me the ropes with docker images.

  • Should we have a docker in each zenodo repo, or make a docker image instance in zenodo that is referenced and periodically updated.

    • Depends on size of docker image, if less than 2 GB then include.

  • Are we holding off on Forcefield archival until metadata to extras change?

    • Yes

  • JW: We would strongly like for the docker to be included

Old Issue of the Week

One-click QCArchive data (8/2019)

  • Basically a collaborator was overwhelmed with the number of datasets and their inability to search them easily. The consensus appears to be that adding tags to differentiate OpenFF data from others is the solution. Then left hanging….

BONUS: Automating QCArchive dataset submission (9/2019)

  • John discusses what appears to be a predecessor to QCSubmit

BONUS: Add collection tags to lifecycle (8/2020)

  • David suggests that CI updates PR tags as datasets move through the lifecycle

  • First and second issues are closed.

Action items

Decisions