2025-07-15 QCA Dataset Submission Meeting

2025-07-15 QCA Dataset Submission Meeting

Participants

  • @Jennifer Clark

  • @Jeffrey Wagner

Discussion topics

 

Item

 

Item

Update Dataset Tracking

Project Board; Slides

  • Running PR 440: Chodera tmQM

    • Still moving

  • PR449: In scientific review

  • JW Agrees with everything JC says

QDS handling of non-QCSubmit dataset.

Scaffold Submission PR is completed and approved

  • Reliant on PR to update QCFractal version on QDS, in current iteration

Dataset archival project

Complete!

MolSSI Info / Align Priorities on MolSSI Asks

Notes July 8th meeting

New from last QCAUM meeting:

  • New release QCF v6.2!

  • Ben offered VTech resources 96 cores + 768 GB. He has lots of CPU time in my allocation that I don’t use.

  • Issue with large dataset views, Ben suggested saving without trajectories:

  • ij = ds.create_view( description=f"Full {ds_name} {ds_type} dataset", provenance={}, include=['**'], exclude=["wavefunction", "trajectory"], include_children=True )

However include_children=True overwrites this open, so it turns out that include_children=False will greatly reduce the file size (217 GB to 900 MB for the Industry Benchmarking Dataset v1.2)

  • Built in QCFractal error cycling is available. Ben has enabled for us.

Requests:

  • Consider running pr449-1200 records on MolSSI resources?

  • Ben is making some changes to TorsionDrive records on a new branch. Can we test?

Old Issue of the Week

Conformer generation should fall back to RDKit ETKDG on Omega failures (Closed!)

  • John suggests that if Omega fails in generating initial conformers, RDKit should be the fallback.

    • JW: This complicates provenance

  • Should this be a QCSubmit ticket?

    • JW: Yes, migrating to QCSubmit, using examples from QDS PR #2 we found that RDKit is already used as a fallback.

Bonus: Missing chemistry to (potentially) cover post-release-1 (Not addressed this week)

  • [#8]~[#35]: O-Br single bonds are present in GAFF2 but not present in our current datasets. We could port in a placeholder value from GAFF2, but there are no molecules with this chemistry in our current datasets.

    • Still not addressed

  • [#7X3]~[#7X3]~([#8])~[#8]: Nitroamines

  • [#6:1]~[#6:2]=[#15:3]~[#6:4]: C=P double bond (potentially with adjacent singles)

    • Still not addressed

AI summary

QDS/QCSubmit Meeting Summary (July 15, 2025)

Current Projects and Progress

  • Jennifer is working on converting QC Elemental molecules to RD Kit molecules to enable sorting based on fingerprints and connectivity

  • This capability will help with better train-test splits of data and is of interest to Chris as well

  • For molecules failing to assess in TMQM, Jennifer plans to use MoleAssembler to build complexes of interest

Data Set Issues

  • Many calculations showed SCF convergence failures after 500 iterations

  • 14 out of 30 errored structures had incorrect charges reported in CCD files

  • Jennifer is developing a method to predict oxidation states, which she'll present to Richard on Thursday

  • For the remaining 16 structures with issues, Jennifer plans to implement a tiered optimization approach with loose tolerances initially

Completed and Ongoing Tasks

  • Scaffold submission PR is completed and approved

  • Jennifer needs to submit a PR with the new QC Fractal version

  • Data set archival is complete but waiting for James and Lily to return before closing

  • Using include_children=false reduced dataset size from 217GB to 900MB

QC Fractal and Testing

  • Ben contacted Jennifer about testing OpenFF code against a development branch of QC Portal

  • Jeffrey demonstrated how to set up CI testing with a different version of QC Fractal

  • For an old issue regarding conformer generation, they discovered RD Kit is already implemented as a fallback for Omega failures

Action items

Decisions