2024-09-25 Meeting notes

Date

Sep 11, 2024

Participants

@Matt Thompson
@Alexandra McIsaac
@Brent Westbrook (Unlicensed)
@Jeffrey Wagner

Discussion topics

Item	Presenter	Notes

Item	Presenter	Notes
New input/output models	MT	Input models (“YAMMBS inputs”) are JSON on disk Tagged with a name (for the dataset) and version (of the model) Fairly large on disk mostly due to storing coordinates Can be derived from QCArchive (or similar source) but recommended to use these as starting points for benchmarking runs Output models also JSON on disk, WIP https://docs.google.com/drawings/d/1JaxY_gUXmMviKDxDQZZaGU8SS0Jc24Jg6tELY4A4iLQ/edit
Ingesting non-QCArchive dataset	MT	Use case: load a small molecule dataset (SDF or similar file(s)) from somewhere other than QCArchive Is this still an important use case? LM – Not important to me. I’d check with LW though since she’s working on all sorts of things. BW – Same Recommended data source to use in testing?
File size in new JSON models	MT	Can’t cleanly compress `list[float]` in JSON Could round to ~10 decimals, but deferring if/until file size and/or JSON read/write times are a significant issue
		MT goes through diagram of refactor Other QCArchive datasets to worry about? JW - org only uses singlepoint, optimization and torsion drive datasets. There are others (hessians? ESPs?) that are out there but not used right now LM – ESP stuff would save wavefunction to recompute ESP MT – Would be good to hold the door open for ESPs in some way if it’s not a big deviation from current plans. LM – ESP stuff would be single points with extra data, which you could add to input mols when needed MT – That makes sense, though we’d need to change the structure of the database, and that would be quite complicated. JW – Could have each molecule have a large arbitrary string for things we didn’t think of (like, could have torsion drive atoms and angles in arbitrary torsiondrive datraset mols) MT: Ok to generally partition data and operations? (General) Yes MT: Different datasets will require different input models, different tables in the database, and different operations on the dataset BW – This morning JH was looking at barrier heights and minima details. Also, torsiondrives are comprised of optimizations. The only extra info you need to put in opt records is the torsion atom indices and constraint angle value. And ultimately you need to have a way to map the opts back to the parent torsiondrive. MT – Thanks for the input, need to think more. Other datasets than QCArchive? In season 1, industry partners had big SDF files. (No energies attached to public molecules.) Unaware of other use cases, starting from there should cover most use cases. JW – Everone started from SDF files with no energies. Then ran multi-step workflow including QCEngine on their workstations and ended up with a bunch of SDFs with energies. Decision point - assume QM SDFs would have final energies. Work around later if this is an issue

Action items

@Matt Thompson ask Lily about non-QCA datasets

@Matt Thompson continue iterating on input/output models

@Matt Thompson make better proposal about processing torsiondrive data

2024-09-25 Meeting notes

Date

Participants

Discussion topics

Action items

Decisions