2024-10-08 QCA dataset submission meeting

Participants

@Alexandra McIsaac
@Jeffrey Wagner
@Brent Westbrook (Unlicensed)
@Lily Wang

Discussion topics

Item	Presenter	Notes

Item	Presenter	Notes
Issues with loading old `BasicResultCollection` datasets	@Alexandra McIsaac	AMI: recaps FF fitting pipeline. Hessians are used in initial guess with modified Seminario Method, but not in fitting afterwards. Our pre-QCPortal 0.50 training datasets don’t have CMILES associated with the Hessian calculations, making it impossible to load them directly as a `BasicResultCollection`, but they are able to be accessed using `OptimizationResultCollection.to_basic_result_collection` from the corresponding `Optimization` datasets Is it possible to add the CMILES now, without re-computing the dataset? QCPortal API Reference — QCArchive 0.57.post12+g4bd800c9 documentation LW: Do we need permission for this? Can anyone edit our molecules? JW: yes there are credentials we have to access our own datasets, one QCA credential is only for submission, one is for arbitrary re-writes LW: Pro editing the old molecules to add CMILES if possible JW: We can run idea by Ben next week at QCA meeting, I’ll add this to agenda If it is not possible, do we want to re-compute the datasets? AMI: IMO not a big enough problem to re-compute If we don’t want to re-compute the datasets and can’t add CMILES, what do we want to happen if someone tries to load these datasets? What error message and/or workaround should we implement? AMI – Could have a message print that, for old datasets, you can try loading it as an optimization. BW – I added a warning for each entry missing CMILES. So this could be an option for a message that prints if all entries are missing CMILES. Or we could even search for Opt datasets that contain this molecule as a final mol, and print out their names. LW – could print out a summary like “how many mols were missing CMILES out of how many mols in the dataset”. The printouts for each bad mol were quite overhwleming AMI – Agree JW – So how about we target: An always-shown summary statement of how many entries were missing CMILES. A by-default not-shown warning for EACH entry missing CMILES (but enable-able with warning/logging level controls). If some/all CMILES are missing, always print a final warning that some legacy datasets are only loadable from an optimization set (usually with the same name as the singlepoint dataset)
Issue with coordinate precision in `OptimizationResultCollection.to_basic_result_collection`		In our new S dataset, 600/900 conformers have geometries that differ from the `Optimization` dataset final geometry by >1e-9 A (but <1e-8 A), leading to them not being recognized as the same molecule for `OptimizationResultCollection.to_basic_result_collection` LW: this doesn’t appear to be an issue with previous datasets as they were constructed from QCA objects and directly linked molecule IDs. Link: JW: can we directly attach CMILES to these single points? AMI: I think we are, we use qcsubmit and each record has a CMILES in the new datasets BW: agree, I think the CMILES are fine in the new datasets How to proceed? Short term: How do we get current datasets to be usable? LM – in the very short term, current patchwork solution of different methods to handle old and new datasets is working. It will get annoying long-term but nothing is currently blocked. JW – Could also resubmit, we have no lack of compute LW – Any resubmission would have the same problem BW: Could separate out MSM step, just download Hessians for that and not convert LM: A bit expensive to re-filter LM: Also, still has to be patchwork since the older datasets aren’t downloadable directly. But this is less patchwork than my existing solution LW: Charge check is most expensive, not actually necessary here--we don’t assign charges Medium term: How do we change our processes/infrastructure to prevent this from surprising us in the future LM – could look at qcsubmit to determine where precision loss might be happening BW – possibly `create_basic_dataset` could just be updated not to round-trip through our infra LW – Or just directly use molecule IDs instead of doing anything with coordinates? BW and LM will give this a shot BW+JW – Could pull down data for training using SinglepointDataset instead of OptimizationDataset, since modern datasets have CMILES ? May be technical issues with current target setup pipeline, possibly involving Bespoke Fit LW: I’m not a massive fan of this, as it would rely on every optimization dataset having a single point dataset JW will ensure that bespokefit is on track to be maintained in a way that lets it continue being used for production fitting. Long term: Very long term: JW – Could refactor QCSubmit since the current QCA doesn’t have the limitations on storing CMILES that initially inspired a lot of QCSubmit’s mission

Meetings

2024-10-08 QCA dataset submission meeting

Participants

Discussion topics

Action items

Decisions