2025-07-01 QCA Dataset Submission Meeting

Participants

@Jennifer Clark
@Jeffrey Wagner
@Lily Wang

Discussion topics

	Item

	Item
Update Dataset Tracking	Project Board; Slides Running PR 440: Chodera tmQM and PR 453: Hessians Still moving JW: The tmQM dataset at its current rate has an ETA of mid-2026. Are we gaining value from continuing to compute it? JC – Unclear, some questions about quality of confs, and CI is observing some high energies in output. Considering making a dataset of minimal mols. LW – Was also surprised how long hessian calcs took for my datasets as well, for already-optimized geometries. JC – The TMQM workers had kinda locked up and I only caught it late last week, so the delta of progress since last meeting might not be representative. I should look into how porphyrins are doing - those are the most important thing. Could cancel things in larger bins. LW – Do recently-error-cycled tasks go to the front of the queue? Unsure, we should ask BP JW – Does babysitting jobs take much of your attention JC? JC – Some, just need to remember to run the fetching code and come back 30 minutes later to error cycle afterwards. JC – I’m in favor of continuing to run this while I don’t have higher-priority datasets. LW – Agree, seems to be making forward progress JW – Sounds good. JC – I’ll ramp up workers for this. PR453 Seems to be going super slowly LW – These aren’t too important for me, we won’t come close to using all of this in a force field fit. So if we want to end this for now and consider coming back to make this a more targeted dataset later that would work. JW + JC – Agree, JC will stop computing this and can resubmit mols of interest later if we’re more focused. PR449 LW – Some idea what level of resources you’ll need if resubmitted? JC – I expect far less. Though there was some chatter online that SOS-MP2 might be implemented incorrectly. (General) – JC will do resubmissions of PR449 varying in several ways to debug resource usage
QDS handling of non-QCSubmit dataset.	Scaffold Submission PR is completed and ready for review
Clean benchmark releases	PR475 Industry Benchmark is almost ready to go but not all molecules are transferring to a new dataset… Make sure the original entry names are used with the company name and index MLPepper Dataset is in process. Closes #465 #466 #473 Has Josh H’s affiliation changed at any point that should be reflected in the Zenodo record? LW – Not sure when JH joined OpenFE but I’d put his affiliation as newcastle/openff since that’s most relevant to the work JW – Agree, don’t spend too long on this. Zenodo record for `OpenFF ESP Fragment Conformers v1.0` is in progress singlepoint datasets aren’t covered in the existing ipynb for docker, and some datasets don’t follow standards with attributes. We have our notebook labeled as “v1.0” should I make a new version with: molecule final_molecule final_molecules[0] `getattr(record, mol_attribute[ds_type]).extras[“canonical…“]` and try:except with: `getattr(record, mol_attribute[ds_type]).attributes[“canonical…“]` should I make a “data_handling_singlepoint.ipynb” LW – Would one or the other be simpler for you? JC – The 1.0/1.1 thing would make the most sense for FF release work. But maybe I should make different notebooks for singlepoints, opts, or TDs. LW – Having a notebook for each dataset type sounds good to me. There are no molecular statistics, do I need to generate those? LW: If it’s not any trouble, it would be good to include them JW – Might be good to set a threshold for like FTE-days (0.1?) LW – I’d just say go for it if it’s not much trouble/mostly just waiting.
MolSSI Info / Align Priorities on MolSSI Asks	No notes from June 24th meeting New from last QCAUM meeting: Communicated that adding the ability to copy records from a database on one server to a database on another server is not a priority for us, but Ben says it’s on his plan of work time horizon. We communicated that we would prefer to get the 5 TB copy, but not right now. Ben offered if we wanted to coordinate a backup that we have access to then that’s on the table. Requests: Ben has recognized receiving my benchmarking notebook but hasn’t gotten to looking into how to speed up record access. I ran a notebook to benchmark the disparity between iterating over entries and records to make a case to Ben that the latter is prohibitive. It takes an order of magnitude longer to iterate through records than entries, even though I’ve fetched both to cache ahead of time. With this in mind, it takes ~90 min to fetch the entries for my large dataset, which would correspond to 15 hours for records Ask if “error cycled” records go to the back or the front of the queue
Old Issue of the Week	Conformer generation should fall back to RDKit ETKDG on Omega failures John suggests that if Omega fails in generating initial conformers, RDKit should be the fallback. Should this be a QCSubmit ticket? Bonus: Missing chemistry to (potentially) cover post-release-1 `[#8]~[#35]`: O-Br single bonds are present in GAFF2 but not present in our current datasets. We could port in a placeholder value from GAFF2, but there are no molecules with this chemistry in our current datasets. Still not addressed `[#7X3]~[#7X3]~([#8])~[#8]`: Nitroamines Addressed with smirks="[#6,#7,#8:1]-[#7X3:2](~[#8X1])~[#8X1:3]" `[#6:1]~[#6:2]=[#15:3]~[#6:4]`: C=P double bond (potentially with adjacent singles) Still not addressed

2025-07-01 QCA Dataset Submission Meeting

Participants

Discussion topics

Action items

Decisions