2021-05-07 QCA Submission meeting notes

Participants

  • @Trevor Gokey

  • @Pavan Behara

  • @Hyesu Jang

  • @David Dotson

Goals

  • User questions/issues, new submissions

  • Infrastructure needs / advances

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

User issues

Hyesu

  • Problems loading existing dataset to combine with new molecules

    • when I tried to load dataset.json.bz2, get validation error that all dataset entries are missing fixed hydrogen inchi

    • JH: recently added inchikey validation, but it means old datasets don’t have it

      • TG: could use older version?

      • [conclusion] Hyesu needs atom map functionality, and so not possible to use old dataset.json directly

  • HJ: is it expected to have tautomers enumerated in optimization datasets, or can we skip them?

    • TG: it’s up to you, if you want to have them or not

    • HJ: Do we still have problems with order dependence between enumerating protomers and tautomers

    • JH: only an issue for torsiondrive datasets, if you tag dihedrals first or not

    • JH: for the optimization dataset, can you send me the SMILES list you’re using?

      • HJ: yes, no problem

  • TG: if we add and remove things from the data structure, we’ll have issues like this one

    • JH: will always be able to go back to previous version of qcsubmit; otherwise really hard to support full backwards compatibility

      • for automation, can get away with pulling dataset name and other things out of the data structure without passing it through e.g. TorsiondriveDataset.__init__

    • DD: what if take the approach of treating QCSubmit as a strict structure that you must abide by if you want to use its workflow components; can always pull apart old dataset.jsons directly as pure python objects?

      • TG: now have the problem of requiring expert knowledge of data structures for users

      • JH: would prefer to make the new fields optional so they don’t trigger pydantic validation; once it’s existed for some time and is very stable, can make it required

      • DD: a bit like a reverse deprecation; at some future release the validation will be required

      • JH: yes, I like that

  • TG: another hot take: possible to make the code use inchikey if there, not use it if it’s not?

    • JH: another good idea; if we overwrite the init, see if fixed hydrogen inchi is there. If not, generate and add it

  • [decision] make inchi keys optional from pydantic’s perspective

    • JH: need to consider how we phase in new functionality like this in the future; optional → required

Enamine real

Trevor

  • Avenues to do submission outside of qca-dataset-submission?

    • TG: please put together a procedure for how best to put together a submission that is not submitted via GHA

Action items

@Hyesu Jang will send @Joshua Horton the SMILES list she is using for assembling the OptimizationDataset, in which tautomer and protomer enumeration is very order dependent
@Joshua Horton will make inchi keys optional in openff-qcsubmit dataset entries for now, phase in required usage and decide if they are generated when not present on a newly-loaded set
@David Dotson will write up approach to submitting a dataset outside of qca-dataset-submission automation; needed for datasets that will almost certainly take longer than GHA allows for submission
@David Dotson modify automation for lifecycle to ensure it doesn’t choke on old datasets missing inchikeys [confirmed that current implementation is unaffected; only uses pydantic model for submission]

Decisions

Â