2025-01-06 Science team meeting

Participants

@Lily Wang
@Jennifer A Clark
@Jeffrey Wagner

Goals

Dataset replication issue
Discuss Transition Metal Forcefield (TM-FF) project plan
Discuss data longevity project (name may change) with communication with Ben Pritchard and Chris Iacovella

Discussion topics

Item	Presenter	Notes

Item

Presenter

Notes

Project updates

JCl

Past week:

TM-FF
- Created draft of plan and ran it by Chris Iacovella
- Expect meeting with Genentech when Richard is back at the end of the month. Monthly meeting with Genentech and off biweekly without them.
- Worked with Chris to separate properties of primary and secondary importance. (Now in compiled document)
  - Those of primary importance should come from single point calculations, but not all seem accessible from QCA
  - Those of secondary importance require a freq calculation and are of scientific not FF dev interest and so are not a priority
- Chris would be happy with GFN-xTB or Brent’s dataset that exist
- Work through the errors, maybe work with Jeff while onboarding NRP
- LW: Let me know if you want to set up a meeting
Replicating calculation issue:
- First sage 2.0 opt merged and created new records. Need to resolve before merging other datasets
- create test that new records will be made, maybe a flag, generate_new_records=False?
- Sage 2.0 TD ready and waiting
- Sage 2.1 combined Opt and TD (ensure this works properly before preparing Sage 2.2)
Dataset Longevity
- Ben is working on making “views” of these files (HDF5) available server side to download and wonders if we are interested. We could request the server create these files from a dataset and upload them to an S3 bucket (or some other S3 compatible storage). These would be then attached to the dataset and downloadable with QCPortal
  - Chris says that this was standard functionality before the major upgrade and if very useful, although is concerned that QCA is supposed to be a living database and these views are static snapshots.
- Ben also recommended a downloader script that Peter Eastman created.
  - Chris has worked with this downloader and parallelized it. The script as is takes 19 hr to download SPICE2 but with Chris’ version it takes 1.5 hrs.
    - JW: 19 hrs isn’t a dealbreaker for a process that happens rarely
  - Chris said that Ben already added something to download files with SQL and they are planning to use that
  - Chris has a repo that he is refining to allow them to only pull the information they need for fitting. Given the overlap in interest, I expressed that we are interested in collaborating on this and will likely reach out to see what he has.
  - JW: number 1 concern is possible need to map QCFractal versions to records
    - LW: does hdf5 or sql solve these concerns?
    - JW: no preference between hdf5 vs sql, the question is how parseable they are by tools
  - JC: not knowledgeable about workflow of QCArchive → OpenFF fits. Does this require QCFractal or is it straightforward?
  - LW – Depends on meaning of “striaghtforward” - Different fitting pathways use different representations/converters. It would be great for the OpenFF workflow to remain compatible with native QCFractal objects, but if we have to go through an intermediate representation it’s not a dealbreaker.
  - LW – Somewhat opposing design goals here - If we use a really general file format, then we’ll need to do a lot of work to get it into our pipelines. But if we use a really specific file format, then other people will need to do a lot of work to get it into their pipeline
  - JC – Based on my experience on submitting to QC datasets, we provide conda deps and python scripts and everything, so we probably want to do something similar for zenodo. So maybe we can use our specific file formats as long as we provide instructions for how to make an env to open them.
  - LW – two ways to go about this:
    - minimalistic way: dataset only, which is convenient for people who just want to download the dataset
    - full-provenance way, including all scripts used to get to the output dataset
      - QCA-DS philosophy:
        Starting point: QCArchive instance
        End point: output format file
  - JW: all our datasets are on a continuum. This spans from hopelessly general (e.g. xyz files) to very specific. If people are trying to reproduce our work, they would use the specific work. If they are trying to just use our data, they’d prefer the general way. Since the computer is doing all the work, we could commit to doing both. One question is, how do we format all the data?
  - LW – There’s a distinction between data and workflow. Maximalist approach is reminiscnet of having a reproducible workflow. Datasets underlying workflow can be more general.
  - JC – Sounds preferable to put up hdf5s on zenodo to have a quick solution. But thinking about future generations, it’s hard to foresee their needs. So like if MolSSI goes down in the future, it’ll be hard to construct an env to process the data into a pipeline. But we’re not sure whether this problem will exist, or what the detailsof it will be.
  - LW – Agree that we’re having to speculate a lot here.
  - JW: Files “expire” if the program that reads/writes them stop being maintained. But if the files are written to an open specification then they’re immortal. There’s lots of good molecule specifications. But some properties are just “whatever psi4 writes” - ex. is there an open specification for WFs?
  - LW –
  - JCl –
  - JW – hdf5 could contain different information, e.g. QCSchema mols with psi4 wavefunctions vs different components. The contents of the hdf5 file should conform to an open specification; if they’re a QCSchema psi4 wavefunction and we’ve lost the software to reconstitute them, then that’s not helpful.
  - Formats:
    - QCSchema
      - downsides – does QCSchema store wavefunctions, or is that just an output from Psi4?
    - Our own schema
    - What SPICE did (but isn’t this just the same as the above?)
    - A big pile of SDFs (“truly immortal”)
      - Can’t store wavefunctions
      - Will require some organization/hierarchy, which is de facto us creating a schema
  - JC – Wavefunctions aren’t really needed/can quickly be recomputed
  - (basically anything we do here other than using an hdf5/something provided by molssi is us creating a new schema)
  - (decision) So we’ll plan on going with something like the HDF5 file exporter. We can either use CIacovella’s exporter or ask BP about timelines. We’ll ask BP about timelines, and if he’ll take a while, we’ll start exporting using CI’s exporter.
  - LW –
    - positions
    - hessians
    - energies
  - (looked at transition metal plans)
  - JC – Some of the nice-to-have calcs are really storage-heavy.
  - JC – BW found that QCPortal has a way to get spin densities and orbital energies….
  - JW – maybe the exporter could export all available properties - kinda rely on “the submitted requested these props, so they’re probably important to the science”
  - LW – When requesting props from QCA (ex hessians), you have to request them using the “driver” keyword…
  - JC – It’s sometimes hard to know which drivers a mol has been submitted with. That is, the results dict produced is dictated by the driver, and I don’t see a spec to know which output fields to expect for a given driver.
  - LW – Could try explicitly testing these, or asking BP if he has documentation of the models anywhere.

Next week:

TM-FF
- Debug failing opt for Brents TM database
- Work on exposing needed single point properties
- Finalize project plan, in ZenHub?
Replicating calculation issue:
- create test that new records will be made, maybe a flag, generate_new_records=False?
Dataset Longevity
- Discuss questions with Ben at QCA User meeting
  - We want HDF5 files with QCSchema molecules
    Are wavefunctions included in these? How?
    Is there a standard file format for wave functions that we could apply?
    Are all attributes recorded for these? class -> dict
    What is the time horizon for obtaining HDF5 views?
  - Is there a way to ping qcportal to see if a record exists or not before pushing the dataset?
  - Is there a way to know what keys are in the results dictionary for each driver?
- Set up meeting with Chris Iacovella to assess his downloader.
- Maybe put together project plan on confluence

Project updates

LW

Past week:
- Setting up EquilibrationLayer and PreequilibratedSimulationLayer technical workflow in Evaluator
This/next week:
- Figuring out sensible default kwargs for Evaluator (primarily around equilibration times and conditions)
- Trialling several use cases
  - fitting a FF (projects: NAGL, DCole)
  - benchmarking across water models (stakeholder: Shirts group)
- Working out hand-off to infrastructure team ( ? ) and what to do about current gaping holes in code
- Debugging protein benchmarks

Action items

Jen: Reach out to Ben: Is there a file format specification for WFs? HDF5 of QCSchema molecules is the goal, if we can capture more complex attributes

Ask Ben about checking if records exist before submitting, in PR stages?

Decisions

We’ll plan on going with something like the HDF5 file exporter for QCSchema molecules. We’ll ask BP about timelines, and if he’ll take a while, we’ll start exporting using CIacovella’s exporter.