Page Comparison

...

Item

Presenter

Notes

Updates on server migration

BP

BP – Started splitting out ML side from OpenFF side
Copied whole database and splitting out

SPICE 2

PE, JC

(Brief introduction of SPICE: public dataset of QM data for training machine learning potentials)
Github link macro
link https://github.com/openmm/spice-dataset
SPICE 1.0 was MVP, version 2 is what we add next to make it as useful as possible
https://www.nature.com/articles/s41597-022-01882-6
Open questions:
- Level of theory – SPICE 1 has a pretty high level DFT
- Do we want to compute some portion of SPICE at a higher level of theory for more accuracy, lower level that can be done very quickly?
https://docs.google.com/document/d/1ktgqDaBAZS5q5kV4UgmHPbixtmKPvgcLPYpKese_cYI/edit
Github link macro
link https://github.com/openmm/spice-dataset/wiki/OpenMM-SPICE-2.0-whitepaper
Github link macro
link https://github.com/openmm/spice-dataset/issues

JC – possibility of local QCFractal/QCArchive?
- BP – Traditionally you have central server that can be able to contact outside server, but I can see how pharma might be unhappy with that
- BP – I have infrastructure to add completed single point calculation that was computed somewhere else, and I can expose that
- JC – if MM vs QM geometries are very different, we might do a bad job of representing surface. We will try using gfn-xtb to ameliorate problem
- BP – don’t foresee major problems with that in the future
- JC – how easy is it to set up our own server? docs?
- BP – I can provide docker images if that’s helpful. Docs are ok
- JR – that would be perfect. BSwope, who has been consulting for us, has a bespokefit instance running, and as I understand it it’s the same infrastructure. But Docker would be even easier
- BP – I’ll have to write the export part to get the hdf5 file, but don’t foresee huge issues
- JR – will have to check with IP people that there’s no problem with us computing some of the molecules
- PE – hopefully not a problem, trying to ensure that SPICE molecules are public domain
- PE – we will only use enamine molecules if we can put them into the public domain
- JR – worried that compositions of molecules might be patented
- JC – we are avoiding it specifically to avoid copyright issues

BP – SPICE v2 next steps?
- PE and Marcus will be setting up some datasets
- Then we need to figure out how to do QCSubmit part
- JR – wavefunctions or densities? Could be impractical to store all wavefunctions. We could come up with a strategy to choose some to save, e.g. minima, or high fidelity calculations
- PE – we were initially going to store wavefunctions in SPICE 1, but would have filled up QCA in 3 days. Storage is definitely the issue here
- BP – we could store everything but do we want to? How big do you anticipate it being?
- PE – v1 has ~1.1 million conformations. v2 will have roughly same number of additional confs with 40-50 atoms, about the same size as v1.
- BP – we estimated 5-6 TB for wavefunctions for v1.
- JC – could we retrieve parts of the dataset instead of everything? Downloading 5-6 TB could be a lot. Could we split up the dataset?
- BP – we can certainly store 6 TB.
- JR – for xtb calculations we can throw away. For some DFT, for anything higher, we should store the density and/or wavefunction.
- JR – CBS limit, we could do two levels with different basis sets for extrapolation
- JR – not aware of good work of CBS of densities and wavefunctions
- JC – will have to do multiple calculations of subsets anyway, might as well?
- JR – agree
- JR – the best place to start is to store some subset of DFT wavefunctions, and pick the last and most-optimised structure. Or for a series of conformers, save it for the lowest energy one. I can volunteer to sketch out these heuristics
- JC – would we want to figure out how to do this on SPICE 1.0 since we already have that?
- JR – sure
- JC – are ESPs of interest as well?
- JR – we should only save wfn and re-compute it. Should be relatively easy to package code that does it easily for users
- BP – I can sit down and give you a storage quota. New server has ~140 TB space. We can also do an attached storage box, probably for an additional ~200TB. It would be spinning disk.
BP – storing and archiving data permanently is an open question wrt QCFractal
JC – general storage and distribution solutions for OpenMM, OpenFF, MolSSI?
- https://qcarchive.molssi.org/apps/ml_datasets/
- LW – OpenFF doesn’t have a formal solution written down yet
- BP – a good option is to move to a static website generator from JSON blobs on a repository
- BP – PubChemQC is hosted on a personal sharepoint. We’re still interested in re-formatting these datasets to make them even easier to use, but we’ll leave that to the ML guy at MolSSI. We’re quite interested in PubChemQC
JC – online datasets
BP – other datasets
- BP – want to compute across the periodic table to make a basis set recommender. We can’t necessarily use density fitting since that ruins the dataset
- Doing some work with tmQM dataset but molecules are quite big
  - https://pubs.acs.org/doi/10.1021/acs.jcim.0c01041
- Also looking at MOPAC reference dataset
PE – would be good to standardise on some levels of theory
- JC – have you ever looked at UFF? This is a whole periodic table potential function
- PE – GFNFF is sort of designed to be a UFF successor. It’s a lot more accurate too
BP – compute starved at the moment
- JC – DD can spin up workers on lilac. We should use every bit of time available. Ask Mobley about UCI too

Action items

Add Bill Swope to training

Versions Compared

Old Version 3

New Version 4

Key

Action items

Decisions