Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Participants

Goals

Discussion topics

Item

Presenter

Notes

Updates on server migration

BP

  • BP – Started splitting out ML side from OpenFF side

  • Copied whole database and splitting out

SPICE 2

PE, JC

  • (Brief introduction of SPICE: public dataset of QM data for training machine learning potentials)

  • SPICE 1.0 was MVP, version 2 is what we add next to make it as useful as possible

  • https://www.nature.com/articles/s41597-022-01882-6

  • Open questions:

    • Level of theory – SPICE 1 has a pretty high level DFT

    • Do we want to compute some portion of SPICE at a higher level of theory for more accuracy, lower level that can be done very quickly?

  • https://docs.google.com/document/d/1ktgqDaBAZS5q5kV4UgmHPbixtmKPvgcLPYpKese_cYI/edit

  • JC – possibility of local QCFractal/QCArchive?

    • BP – Traditionally you have central server that can be able to contact outside server, but I can see how pharma might be unhappy with that

    • BP – I have infrastructure to add completed single point calculation that was computed somewhere else, and I can expose that

    • JC – if MM vs QM geometries are very different, we might do a bad job of representing surface. We will try using gfn-xtb to ameliorate problem

    • BP – don’t foresee major problems with that in the future

    • JC – how easy is it to set up our own server? docs?

    • BP – I can provide docker images if that’s helpful. Docs are ok

    • JR – that would be perfect. BSwope, who has been consulting for us, has a bespokefit instance running, and as I understand it it’s the same infrastructure. But Docker would be even easier

    • BP – I’ll have to write the export part to get the hdf5 file, but don’t foresee huge issues

    • JR – will have to check with IP people that there’s no problem with us computing some of the molecules

    • PE – hopefully not a problem, trying to ensure that SPICE molecules are public domain

    • PE – we will only use enamine molecules if we can put them into the public domain

    • JR – worried that compositions of molecules might be patented

    • JC – we are avoiding it specifically to avoid copyright issues

  • BP – SPICE v2 next steps?

    • PE and Marcus will be setting up some datasets

    • Then we need to figure out how to do QCSubmit part

    • JR – wavefunctions or densities? Could be impractical to store all wavefunctions. We could come up with a strategy to choose some to save, e.g. minima, or high fidelity calculations

    • PE – we were initially going to store wavefunctions in SPICE 1, but would have filled up QCA in 3 days. Storage is definitely the issue here

    • BP – we could store everything but do we want to? How big do you anticipate it being?

    • PE – v1 has ~1.1 million conformations. v2 will have roughly same number of additional confs with 40-50 atoms, about the same size as v1.

    • BP – we estimated 5-6 TB for wavefunctions for v1.

    • JC – could we retrieve parts of the dataset instead of everything? Downloading 5-6 TB could be a lot. Could we split up the dataset?

    • BP – we can certainly store 6 TB.

    • JR – for xtb calculations we can throw away. For some DFT, for anything higher, we should store the density and/or wavefunction.

    • JR – CBS limit, we could do two levels with different basis sets for extrapolation

    • JR – not aware of good work of CBS of densities and wavefunctions

    • JC – will have to do multiple calculations of subsets anyway, might as well?

    • JR – agree

    • JR – the best place to start is to store some subset of DFT wavefunctions, and pick the last and most-optimised structure. Or for a series of conformers, save it for the lowest energy one. I can volunteer to sketch out these heuristics

    • JC – would we want to figure out how to do this on SPICE 1.0 since we already have that?

    • JR – sure

    • JC – are ESPs of interest as well?

    • JR – we should only save wfn and re-compute it. Should be relatively easy to package code that does it easily for users

    • BP – I can sit down and give you a storage quota. New server has ~140 TB space. We can also do an attached storage box, probably for an additional ~200TB. It would be spinning disk.

  • BP – storing and archiving data permanently is an open question wrt QCFractal

  • JC – general storage and distribution solutions for OpenMM, OpenFF, MolSSI?

    • https://qcarchive.molssi.org/apps/ml_datasets/

    • LW – OpenFF doesn’t have a formal solution written down yet

    • BP – a good option is to move to a static website generator from JSON blobs on a repository

    • BP – PubChemQC is hosted on a personal sharepoint. We’re still interested in re-formatting these datasets to make them even easier to use, but we’ll leave that to the ML guy at MolSSI. We’re quite interested in PubChemQC

  • JC – online datasets

  • BP – other datasets

    • BP – want to compute across the periodic table to make a basis set recommender. We can’t necessarily use density fitting since that ruins the dataset

    • Doing some work with tmQM dataset but molecules are quite big

    • Also looking at MOPAC reference dataset

  • PE – would be good to standardise on some levels of theory

    • JC – have you ever looked at UFF? This is a whole periodic table potential function

    • PE – GFNFF is sort of designed to be a UFF successor. It’s a lot more accurate too

  • BP – compute starved at the moment

    • JC – DD can spin up workers on lilac. We should use every bit of time available. Ask Mobley about UCI too

Action items

  • Add Bill Swope to training
  •  

Decisions

  • No labels