Chris has worked with this downloader and parallelized it. The script as is takes 19 hr to download SPICE2 but with Chris’ version it takes 1.5 hrs.
Chris said that Ben already added something to download files with SQL and they are planning to use that
Chris has a repo that he is refining to allow them to only pull the information they need for fitting. Given the overlap in interest, I expressed that we are interested in collaborating on this and will likely reach out to see what he has.
JW: number 1 concern is possible need to map QCFractal versions to records
LW: does hdf5 or sql solve these concerns?
JW: no preference between hdf5 vs sql, the question is how parseable they are by tools
JC: not knowledgeable about workflow of QCArchive → OpenFF fits. Does this require QCFractal or is it straightforward?
LW – Depends on meaning of “striaghtforward” - Different fitting pathways use different representations/converters. It would be great for the OpenFF workflow to remain compatible with native QCFractal objects, but if we have to go through an intermediate representation it’s not a dealbreaker.
LW – Somewhat opposing design goals here - If we use a really general file format, then we’ll need to do a lot of work to get it into our pipelines. But if we use a really specific file format, then other people will need to do a lot of work to get it into their pipeline
JC – Based on my experience on submitting to QC datasets, we provide conda deps and python scripts and everything, so we probably want to do something similar for zenodo. So maybe we can use our specific file formats as long as we provide instructions for how to make an env to open them.
LW – two ways to go about this:
minimalistic way: dataset only, which is convenient for people who just want to download the dataset
full-provenance way, including all scripts used to get to the output dataset
JW: all our datasets are on a continuum. This spans from hopelessly general (e.g. xyz files) to very specific. If people are trying to reproduce our work, they would use the specific work. If they are trying to just use our data, they’d prefer the general way. Since the computer is doing all the work, we could commit to doing both. One question is, how do we format all the data?
LW – There’s a distinction between data and workflow. Maximalist approach is reminiscnet of having a reproducible workflow. Datasets underlying workflow can be more general.
JC – Sounds preferable to put up hdf5s on zenodo to have a quick solution. But thinking about future generations, it’s hard to foresee their needs. So like if MolSSI goes down in the future, it’ll be hard to construct an env to process the data into a pipeline. But we’re not sure whether this problem will exist, or what the detailsof it will be.
LW – Agree that we’re having to speculate a lot here.
JW: Files “expire” if the program that reads/writes them stop being maintained. But if the files are written to an open specification then they’re immortal. There’s lots of good molecule specifications. But some properties are just “whatever psi4 writes” - ex. is there an open specification for WFs?
LW –
JCl –
JW – hdf5 could contain different information, e.g. QCSchema mols with psi4 wavefunctions vs different components. The contents of the hdf5 file should conform to an open specification; if they’re a QCSchema psi4 wavefunction and we’ve lost the software to reconstitute them, then that’s not helpful.
Formats:
JC – Wavefunctions aren’t really needed/can quickly be recomputed
(basically anything we do here other than using an hdf5/something provided by molssi is us creating a new schema)
(decision) So we’ll plan on going with something like the HDF5 file exporter. We can either use CIacovella’s exporter or ask BP about timelines. We’ll ask BP about timelines, and if he’ll take a while, we’ll start exporting using CI’s exporter.
LW –
positions
hessians
energies
(looked at transition metal plans)
JC – Some of the nice-to-have calcs are really storage-heavy.
JC – BW found that QCPortal has a way to get spin densities and orbital energies….
JW – maybe the exporter could export all available properties - kinda rely on “the submitted requested these props, so they’re probably important to the science”
LW – When requesting props from QCA (ex hessians), you have to request them using the “driver” keyword…
JC – It’s sometimes hard to know which drivers a mol has been submitted with. That is, the results dict produced is dictated by the driver, and I don’t see a spec to know which output fields to expect for a given driver.
LW – Could try explicitly testing these, or asking BP if he has documentation of the models anywhere.