2024-10-08 QCA dataset submission meeting

Participants

  • @Alexandra McIsaac

  • @Jeffrey Wagner

  • @Brent Westbrook (Unlicensed)

  • @Lily Wang

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Issues with loading old BasicResultCollection datasets

@Alexandra McIsaac

  • AMI: recaps FF fitting pipeline. Hessians are used in initial guess with modified Seminario Method, but not in fitting afterwards.

  • Our pre-QCPortal 0.50 training datasets don’t have CMILES associated with the Hessian calculations, making it impossible to load them directly as a BasicResultCollection, but they are able to be accessed using OptimizationResultCollection.to_basic_result_collection from the corresponding Optimization datasets

  • Is it possible to add the CMILES now, without re-computing the dataset?

  • If it is not possible, do we want to re-compute the datasets?

    • AMI: IMO not a big enough problem to re-compute

  • If we don’t want to re-compute the datasets and can’t add CMILES, what do we want to happen if someone tries to load these datasets? What error message and/or workaround should we implement?

    • AMI – Could have a message print that, for old datasets, you can try loading it as an optimization.

    • BW – I added a warning for each entry missing CMILES. So this could be an option for a message that prints if all entries are missing CMILES. Or we could even search for Opt datasets that contain this molecule as a final mol, and print out their names.

    • LW – could print out a summary like “how many mols were missing CMILES out of how many mols in the dataset”. The printouts for each bad mol were quite overhwleming

      • AMI – Agree

    • JW – So how about we target:

      • An always-shown summary statement of how many entries were missing CMILES.

      • A by-default not-shown warning for EACH entry missing CMILES (but enable-able with warning/logging level controls).

      • If some/all CMILES are missing, always print a final warning that some legacy datasets are only loadable from an optimization set (usually with the same name as the singlepoint dataset)

    •  

    •  


Issue with coordinate precision in OptimizationResultCollection.to_basic_result_collection



  • In our new S dataset, 600/900 conformers have geometries that differ from the Optimization dataset final geometry by >1e-9 A (but <1e-8 A), leading to them not being recognized as the same molecule for OptimizationResultCollection.to_basic_result_collection

    • LW: this doesn’t appear to be an issue with previous datasets as they were constructed from QCA objects and directly linked molecule IDs.

    • Link:

    • JW: can we directly attach CMILES to these single points?

      • AMI: I think we are, we use qcsubmit and each record has a CMILES in the new datasets

      • BW: agree, I think the CMILES are fine in the new datasets

  • How to proceed?

    • Short term: How do we get current datasets to be usable?

      • LM – in the very short term, current patchwork solution of different methods to handle old and new datasets is working. It will get annoying long-term but nothing is currently blocked.

      • JW – Could also resubmit, we have no lack of compute

        • LW – Any resubmission would have the same problem

      • BW: Could separate out MSM step, just download Hessians for that and not convert

        • LM: A bit expensive to re-filter

        • LM: Also, still has to be patchwork since the older datasets aren’t downloadable directly. But this is less patchwork than my existing solution

        • LW: Charge check is most expensive, not actually necessary here--we don’t assign charges

        •  

    • Medium term: How do we change our processes/infrastructure to prevent this from surprising us in the future

      •  

      • LM – could look at qcsubmit to determine where precision loss might be happening

      • BW – possibly create_basic_dataset could just be updated not to round-trip through our infra

        • LW – Or just directly use molecule IDs instead of doing anything with coordinates?

        • BW and LM will give this a shot

      • BW+JW – Could pull down data for training using SinglepointDataset instead of OptimizationDataset, since modern datasets have CMILES ?

        • May be technical issues with current target setup pipeline, possibly involving Bespoke Fit

        • LW: I’m not a massive fan of this, as it would rely on every optimization dataset having a single point dataset

      • JW will ensure that bespokefit is on track to be maintained in a way that lets it continue being used for production fitting.

      •  

      •  

    • Long term:

    • Very long term:

      • JW – Could refactor QCSubmit since the current QCA doesn’t have the limitations on storing CMILES that initially inspired a lot of QCSubmit’s mission

    •  

    •  

Action items

Decisions