2023-07-11 QC meeting notes

Participants

Goals

Discussion topics

Item	Presenter	Notes
Updates on server migration	BP	BP – Started splitting out ML side from OpenFF side Copied whole database and splitting out
SPICE 2	PE, JC	(Brief introduction of SPICE: public dataset of QM data for training machine learning potentials) SPICE 1.0 was MVP, version 2 is what we add next to make it as useful as possible https://www.nature.com/articles/s41597-022-01882-6 Open questions: Level of theory – SPICE 1 has a pretty high level DFT Do we want to compute some portion of SPICE at a higher level of theory for more accuracy, lower level that can be done very quickly? https://docs.google.com/document/d/1ktgqDaBAZS5q5kV4UgmHPbixtmKPvgcLPYpKese_cYI/edit
		JC – possibility of local QCFractal/QCArchive? BP – Traditionally you have central server that can be able to contact outside server, but I can see how pharma might be unhappy with that BP – I have infrastructure to add completed single point calculation that was computed somewhere else, and I can expose that JC – if MM vs QM geometries are very different, we might do a bad job of representing surface. We will try using gfn-xtb to ameliorate problem BP – don’t foresee major problems with that in the future JC – how easy is it to set up our own server? docs? BP – I can provide docker images if that’s helpful. Docs are ok JR – that would be perfect. BSwope, who has been consulting for us, has a bespokefit instance running, and as I understand it it’s the same infrastructure. But Docker would be even easier BP – I’ll have to write the export part to get the hdf5 file, but don’t foresee huge issues JR – will have to check with IP people that there’s no problem with us computing some of the molecules PE – hopefully not a problem, trying to ensure that SPICE molecules are public domain PE – we will only use enamine molecules if we can put them into the public domain JR – worried that compositions of molecules might be patented JC – we are avoiding it specifically to avoid copyright issues
		BP – SPICE v2 next steps? PE and Marcus will be setting up some datasets Then we need to figure out how to do QCSubmit part JR – wavefunctions or densities? Could be impractical to store all wavefunctions. We could come up with a strategy to choose some to save, e.g. minima, or high fidelity calculations PE – we were initially going to store wavefunctions in SPICE 1, but would have filled up QCA in 3 days. Storage is definitely the issue here BP – we could store everything but do we want to? How big do you anticipate it being? PE – v1 has ~1.1 million conformations. v2 will have roughly same number of additional confs with 40-50 atoms, about the same size as v1. BP – we estimated 5-6 TB for wavefunctions for v1. JC – could we retrieve parts of the dataset instead of everything? Downloading 5-6 TB could be a lot. Could we split up the dataset? BP – we can certainly store 6 TB. JR – for xtb calculations we can throw away. For some DFT, for anything higher, we should store the density and/or wavefunction. JR – CBS limit, we could do two levels with different basis sets for extrapolation JR – not aware of good work of CBS of densities and wavefunctions JC – will have to do multiple calculations of subsets anyway, might as well? JR – agree JR – the best place to start is to store some subset of DFT wavefunctions, and pick the last and most-optimised structure. Or for a series of conformers, save it for the lowest energy one. I can volunteer to sketch out these heuristics JC – would we want to figure out how to do this on SPICE 1.0 since we already have that? JR – sure JC – are ESPs of interest as well? JR – we should only save wfn and re-compute it. Should be relatively easy to package code that does it easily for users BP – I can sit down and give you a storage quota. New server has ~140 TB space. We can also do an attached storage box, probably for an additional ~200TB. It would be spinning disk. BP – storing and archiving data permanently is an open question wrt QCFractal JC – general storage and distribution solutions for OpenMM, OpenFF, MolSSI? https://qcarchive.molssi.org/apps/ml_datasets/ LW – OpenFF doesn’t have a formal solution written down yet BP – a good option is to move to a static website generator from JSON blobs on a repository BP – PubChemQC is hosted on a personal sharepoint. We’re still interested in re-formatting these datasets to make them even easier to use, but we’ll leave that to the ML guy at MolSSI. We’re quite interested in PubChemQC JC – online datasets https://tdcommons.ai/single_pred_tasks/qm/ http://quantum-machine.org/datasets/ http://pccdb.org/ https://qcarchive.molssi.org/apps/ml_datasets/ BP – other datasets BP – want to compute across the periodic table to make a basis set recommender. We can’t necessarily use density fitting since that ruins the dataset Doing some work with tmQM dataset but molecules are quite big https://pubs.acs.org/doi/10.1021/acs.jcim.0c01041 Also looking at MOPAC reference dataset PE – would be good to standardise on some levels of theory JC – have you ever looked at UFF? This is a whole periodic table potential function PE – GFNFF is sort of designed to be a UFF successor. It’s a lot more accurate too BP – compute starved at the moment JC – DD can spin up workers on lilac. We should use every bit of time available. Ask Mobley about UCI too

Action items

Add Bill Swope to training

2023-07-11 QC meeting notes

Participants

Goals

Discussion topics

Action items

Decisions