2024-09-25 Meeting notes

 Date

Sep 11, 2024

 Participants

  • @Matt Thompson

  • @Alexandra McIsaac

  • @Brent Westbrook (Unlicensed)

  • @Jeffrey Wagner

 Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

New input/output models

MT

Input models (“YAMMBS inputs”) are JSON on disk

  • Tagged with a name (for the dataset) and version (of the model)

  • Fairly large on disk mostly due to storing coordinates

  • Can be derived from QCArchive (or similar source) but recommended to use these as starting points for benchmarking runs

Output models also JSON on disk, WIP

Ingesting non-QCArchive dataset

MT

  • Use case: load a small molecule dataset (SDF or similar file(s)) from somewhere other than QCArchive

  • Is this still an important use case?

    • LM – Not important to me. I’d check with LW though since she’s working on all sorts of things.

    • BW – Same

  • Recommended data source to use in testing?

File size in new JSON models

MT

  • Can’t cleanly compress list[float] in JSON

  • Could round to ~10 decimals, but deferring if/until file size and/or JSON read/write times are a significant issue

 

 

MT goes through diagram of refactor

  • Other QCArchive datasets to worry about?

    • JW - org only uses singlepoint, optimization and torsion drive datasets. There are others (hessians? ESPs?) that are out there but not used right now

    • LM – ESP stuff would save wavefunction to recompute ESP

    • MT – Would be good to hold the door open for ESPs in some way if it’s not a big deviation from current plans.

    • LM – ESP stuff would be single points with extra data, which you could add to input mols when needed

    • MT – That makes sense, though we’d need to change the structure of the database, and that would be quite complicated.

    • JW – Could have each molecule have a large arbitrary string for things we didn’t think of (like, could have torsion drive atoms and angles in arbitrary torsiondrive datraset mols)

    • MT: Ok to generally partition data and operations?

      • (General) Yes

      • MT: Different datasets will require different input models, different tables in the database, and different operations on the dataset

    • BW – This morning JH was looking at barrier heights and minima details. Also, torsiondrives are comprised of optimizations. The only extra info you need to put in opt records is the torsion atom indices and constraint angle value. And ultimately you need to have a way to map the opts back to the parent torsiondrive.

    • MT – Thanks for the input, need to think more.

  • Other datasets than QCArchive?

    • In season 1, industry partners had big SDF files. (No energies attached to public molecules.)

    • Unaware of other use cases, starting from there should cover most use cases.

    • JW – Everone started from SDF files with no energies. Then ran multi-step workflow including QCEngine on their workstations and ended up with a bunch of SDFs with energies.

    • Decision point - assume QM SDFs would have final energies. Work around later if this is an issue

    •  

    •  

 Action items

@Matt Thompson ask Lily about non-QCA datasets
@Matt Thompson continue iterating on input/output models
@Matt Thompson make better proposal about processing torsiondrive data

 Decisions