Page Comparison

Participants

...

Discussion topics

Item

Presenter

Notes

Issues with loading old BasicResultCollection datasets

Alexandra McIsaac

AMI: recaps FF fitting pipeline. Hessians are used in initial guess with modified Seminario Method, but not in fitting afterwards.
Github link macro
link https://github.com/openforcefield/openff-qcsubmit/issues/299
Our pre-QCPortal 0.50 training datasets don’t have CMILES associated with the Hessian calculations, making it impossible to load them directly as a BasicResultCollection, but they are able to be accessed using OptimizationResultCollection.to_basic_result_collection from the corresponding Optimization datasets
Is it possible to add the CMILES now, without re-computing the dataset?
- https://molssi.github.io/QCFractal/user_guide/qcportal_reference.html#qcportal.client.PortalClient.modify_molecule
- LW: Do we need permission for this? Can anyone edit our molecules?
- JW: yes there are credentials we have to access our own datasets, one QCA credential is only for submission, one is for arbitrary re-writes
- LW: Pro editing the old molecules to add CMILES if possible
- JW: We can run idea by Ben next week at QCA meeting, I’ll add this to agenda
If it is not possible, do we want to re-compute the datasets?
- AMI: IMO not a big enough problem to re-compute
If we don’t want to re-compute the datasets and can’t add CMILES, what do we want to happen if someone tries to load these datasets? What error message and/or workaround should we implement?
- AMI – Could have a message print that, for old datasets, you can try loading it as an optimization.
- BW – I added a warning for each entry missing CMILES. So this could be an option for a message that prints if all entries are missing CMILES. Or we could even search for Opt datasets that contain this molecule as a final mol, and print out their names.
- LW – could print out a summary like “how many mols were missing CMILES out of how many mols in the dataset”. The printouts for each bad mol were quite overhwleming
  - AMI – Agree
- JW – So how about we target:
  - An always-shown summary statement of how many entries were missing CMILES.
  - A by-default not-shown warning for EACH entry missing CMILES (but enable-able with warning/logging level controls).
  - If some/all CMILES are missing, always print a final warning that some legacy datasets are only loadable from an optimization set (usually with the same name as the singlepoint dataset)

Issue with coordinate precision in OptimizationResultCollection.to_basic_result_collection

Github link macro
link https://github.com/openforcefield/openff-qcsubmit/issues/297

In our new S dataset, 600/900 conformers have geometries that differ from the Optimization dataset final geometry by >1e-9 A (but <1e-8 A), leading to them not being recognized as the same molecule for OptimizationResultCollection.to_basic_result_collection

LW: this doesn’t appear to be an issue with previous datasets as they were constructed from QCA objects and directly linked molecule IDs.

Link:

Github link macro

link	https://github.com/openforcefield/qca-dataset-submission/blob/f0e663bbc7f5457c1884ab0148532e12e996069f/submissions/2019-07-09-OpenFF-Optimization-Set/04_create_hessian_dataset.py#L46-L54
extended	false

JW: can we directly attach CMILES to these single points?
- AMI: I think we are, we use qcsubmit and each record has a CMILES in the new datasets
- BW: agree, I think the CMILES are fine in the new datasets

How to proceed?
- Short term: How do we get current datasets to be usable?
  - LM – in the very short term, current patchwork solution of different methods to handle old and new datasets is working. It will get annoying long-term but nothing is currently blocked.
  - JW – Could also resubmit, we have no lack of compute
    - LW – Any resubmission would have the same problem
  - BW: Could separate out MSM step, just download Hessians for that and not convert
    - LM: A bit expensive to re-filter
    - LM: Also, still has to be patchwork since the older datasets aren’t downloadable directly. But this is less patchwork than my existing solution
    - LW: Charge check is most expensive, not actually necessary here--we don’t assign charges
- Medium term: How do we change our processes/infrastructure to prevent this from surprising us in the future
  - LM – could look at qcsubmit to determine where precision loss might be happening
  - BW – possibly create_basic_dataset could just be updated not to round-trip through our infra
    - LW – Or just directly use molecule IDs instead of doing anything with coordinates?
    - BW and LM will give this a shot
  - BW+JW – Could pull down data for training using SinglepointDataset instead of OptimizationDataset, since modern datasets have CMILES ?
    - May be technical issues with current target setup pipeline, possibly involving Bespoke Fit
    - LW: I’m not a massive fan of this, as it would rely on every optimization dataset having a single point dataset
  - JW will ensure that bespokefit is on track to be maintained in a way that lets it continue being used for production fitting.
- Long term:
- Very long term:
  - JW – Could refactor QCSubmit since the current QCA doesn’t have the limitations on storing CMILES that initially inspired a lot of QCSubmit’s mission

Versions Compared

Old Version 2

New Version 3

Key

Participants

Discussion topics

Action items

Decisions