Issues with loading old BasicResultCollection datasets
@Alexandra McIsaac
AMI: recaps FF fitting pipeline. Hessians are used in initial guess with modified Seminario Method, but not in fitting afterwards.
Our pre-QCPortal 0.50 training datasets don’t have CMILES associated with the Hessian calculations, making it impossible to load them directly as a BasicResultCollection, but they are able to be accessed using OptimizationResultCollection.to_basic_result_collection from the corresponding Optimization datasets
Is it possible to add the CMILES now, without re-computing the dataset?
LW: Do we need permission for this? Can anyone edit our molecules?
JW: yes there are credentials we have to access our own datasets, one QCA credential is only for submission, one is for arbitrary re-writes
LW: Pro editing the old molecules to add CMILES if possible
JW: We can run idea by Ben next week at QCA meeting, I’ll add this to agenda
If it is not possible, do we want to re-compute the datasets?
AMI: IMO not a big enough problem to re-compute
If we don’t want to re-compute the datasets and can’t add CMILES, what do we want to happen if someone tries to load these datasets? What error message and/or workaround should we implement?
AMI – Could have a message print that, for old datasets, you can try loading it as an optimization.
BW – I added a warning for each entry missing CMILES. So this could be an option for a message that prints if all entries are missing CMILES. Or we could even search for Opt datasets that contain this molecule as a final mol, and print out their names.
LW – could print out a summary like “how many mols were missing CMILES out of how many mols in the dataset”. The printouts for each bad mol were quite overhwleming
AMI – Agree
JW – So how about we target:
An always-shown summary statement of how many entries were missing CMILES.
A by-default not-shown warning for EACH entry missing CMILES (but enable-able with warning/logging level controls).
If some/all CMILES are missing, always print a final warning that some legacy datasets are only loadable from an optimization set (usually with the same name as the singlepoint dataset)
Issue with coordinate precision in OptimizationResultCollection.to_basic_result_collection
In our new S dataset, 600/900 conformers have geometries that differ from the Optimization dataset final geometry by >1e-9 A (but <1e-8 A), leading to them not being recognized as the same molecule for OptimizationResultCollection.to_basic_result_collection
LW: this doesn’t appear to be an issue with previous datasets as they were constructed from QCA objects and directly linked molecule IDs.
Link:
JW: can we directly attach CMILES to these single points?
AMI: I think we are, we use qcsubmit and each record has a CMILES in the new datasets
BW: agree, I think the CMILES are fine in the new datasets
How to proceed?
Short term: How do we get current datasets to be usable?
LM – in the very short term, current patchwork solution of different methods to handle old and new datasets is working. It will get annoying long-term but nothing is currently blocked.
JW – Could also resubmit, we have no lack of compute
LW – Any resubmission would have the same problem
BW: Could separate out MSM step, just download Hessians for that and not convert
LM: A bit expensive to re-filter
LM: Also, still has to be patchwork since the older datasets aren’t downloadable directly. But this is less patchwork than my existing solution
LW: Charge check is most expensive, not actually necessary here--we don’t assign charges
Medium term: How do we change our processes/infrastructure to prevent this from surprising us in the future
LM – could look at qcsubmit to determine where precision loss might be happening
BW – possibly create_basic_dataset could just be updated not to round-trip through our infra
LW – Or just directly use molecule IDs instead of doing anything with coordinates?
BW and LM will give this a shot
BW+JW – Could pull down data for training using SinglepointDataset instead of OptimizationDataset, since modern datasets have CMILES ?
May be technical issues with current target setup pipeline, possibly involving Bespoke Fit
LW: I’m not a massive fan of this, as it would rely on every optimization dataset having a single point dataset
JW will ensure that bespokefit is on track to be maintained in a way that lets it continue being used for production fitting.
Long term:
Very long term:
JW – Could refactor QCSubmit since the current QCA doesn’t have the limitations on storing CMILES that initially inspired a lot of QCSubmit’s mission