2025-08-12 QCA Dataset Submission Meeting

2025-08-12 QCA Dataset Submission Meeting

Participants

  • @Jennifer Clark

  • @Jeffrey Wagner

  • @Lily Wang

Discussion topics

 

Item

 

Item

Update Dataset Tracking

Project Board; Slides

  • PR489: Lipid Torsiondrives

    • Canceled previously. 2 TD records errored but “failed” angles remain in running status with stdout seeming to indicate success. Created QCFractal Issue

    • Is Julianne happy with the dataset as is? Even the two that “errored” have 21 and 23 of 24 angles complete. I’m not sure this is worth pushing too hard right now, but could be important in the future

    • JW + LW – Largely ambivalent. Would be good to talk to JH to see if 100% completion is important.

    • JAC – Ok

  • PR449: TMBenchmark

    • In scientific review: Requires TMOS infrastructure to characterize, and new infrastructure to add solvent

    • This is at the top of the docket in conjunction with revised tmQM submission

    • LW – Why so many structures in the 15-17 electron bins for square planar? Shouldn’t we expect to see a lot of 14s where 2 solvents are expected?

    • JAC – Rule of 18 electrons is just a rule of thumb….

    • JW – Would 15-17 electrons be expected for various oxidation states?

    • JAC – Square planar is rare in real life. Solution for lower left quadrant is mostly to see if solvent should be added. Upper left quadrant still needs debugging.

    • LW – (Not urgent, typing so I don’t forget) follow-up Q: are these plots of the finished data or to-compute data? The total counts seem like they’re a bit low either way, is a significant % data unclassifiable?

    • LW – Why so many fewer counts in table than in dataset? Only see 20kish in table

      • JAC – There’s 10 QC records per mol here

    • LW – Have you classified …?

    • LW – Is plan still to sunset larger MW mols?

      • JAC – That was my understanding of the plan, because we want to rpeserve the porphyrins. Unfortunately those are big and square planar, so are doubly problematic. I was going to update checkmol tocount this…

      • LW – Checkmol only identifies specific groups, not porphyrins.

      • JAC – I’m leaving these to run until I’m able to identify which are porph. Could go the other way and stop everything until porhps are identified but I’d prefer to keep them running.

      • LW – If we have the compute resources, I’m in favor of keeping them running. Could use SMARTS matching on outputted RDMols from TMOS to find porph.

    •  

    •  

    •  

    •  

  • Running PR 440: Chodera tmQM

    • Still moving; need to finishing assessment and sort

Dataset archival project

Need characterization of Industry Dataset to expand Blog Post

  • On pause until after ACS

MolSSI Info / Align Priorities on MolSSI Asks

No meeting last week

Requests:

  • Soft nudge on GitHub issue?

  • We've tested the qcfractal cache_serial branch on both openff-qcsubmit and openff-bespokefit. QCSubmit passed all tests using that branch. Bespokefit persistently got hung up on an integration test on macos-latest runners on GitHub, when python=3.11, pydantic=1.10, and openeye are installed (but NOT when OpenEye ISN'T installed).

    • I unfortunately won't have time to dig into this for a while. I don't think that the bespokefit test timing out is a blocker to the QCF branch being merged. Given that the same conda env WITHOUT OpenEye succeeds, it seems likely that the result of debugging will turn out to be some problem with OpenEye or something else in the env. BP, if you want to hedge, you might fire off a testing matrix on macos-latest with python 3.11 and some major variants of your common deps (in particular, from a quick glance at env diffs, QCEngine 0.32 vs 0.33)

     

Old Issue of the Week

Add additional CC-BY license

  • Easy Action: Close with comment: “Resolved with

    • Why don’t we use ODC-By 1.0 (recommended by John)

  • Closed!

Bonus: “Add several high priority datasets for benchmarking

  • DM suggests several datasets, JC volunteers to prep and submit them. List in the issues show that “it’s done”…

  • This dataset started with 646 molecules and was filtered down to 127 to ensure >3 rotators, do we think this issue leaves a task on the table, or are we willing to forget the other <=3 rotators?

  • Action Option 1: Close with reference to merged PRs

  • Action Option 2: Open a new issues for Genentech <= 3 rotors and label as a suggested dataset, then Close with reference to merged PRs and new issue.

  • Closed with Option 1!

AI summary

 

Action items

Decisions