2023-07-11 QC meeting notes

Participants

Goals

Discussion topics

Item	Presenter	Notes
Updates on server migration	BP	BP – Started splitting out ML side from OpenFF side Copied whole database and splitting out
SPICE 2	PE, JC	(Brief introduction of SPICE: public dataset of QM data for training machine learning potentials) SPICE 1.0 was MVP, version 2 is what we add next to make it as useful as possible https://www.nature.com/articles/s41597-022-01882-6 Open questions: Level of theory – SPICE 1 has a pretty high level DFT Do we want to compute some portion of SPICE at a higher level of theory for more accuracy, lower level that can be done very quickly? https://docs.google.com/document/d/1ktgqDaBAZS5q5kV4UgmHPbixtmKPvgcLPYpKese_cYI/edit
		JC – possibility of local QCFractal/QCArchive? BP – Traditionally you have central server that can be able to contact outside server, but I can see how pharma might be unhappy with that BP – I have infrastructure to add completed single point calculation that was computed somewhere else, and I can expose that JC – if MM vs QM geometries are very different, we might do a bad job of representing surface. We will try using gfn-xtb to ameliorate problem BP – don’t foresee major problems with that in the future JC – how easy is it to set up our own server? docs? BP – I can provide docker images if that’s helpful. Docs are ok JR – that would be perfect. BSwope, who has been consulting for us, has a bespokefit instance running, and as I understand it it’s the same infrastructure. But Docker would be even easier BP – I’ll have to write the export part to get the hdf5 file, but don’t foresee huge issues JR – will have to check with IP people that there’s no problem with us computing some of the molecules PE – hopefully not a problem, trying to ensure that SPICE molecules are public domain PE – we will only use enamine molecules if we can put them into the public domain JR – worried that compositions of molecules might be patented JC – we are avoiding it specifically to avoid copyright issues
		BP – SPICE v2 next steps? PE and Marcus will be setting up some datasets Then we need to figure out how to do QCSubmit part JR – wavefunctions or densities? Could be impractical to store all wavefunctions. We could come up with a strategy to choose some to save, e.g. minima, or high fidelity calculations PE – we were initially going to store wavefunctions in SPICE 1, but would have filled up QCA in 3 days. Storage is definitely the issue here BP – we could store everything but do we want to? How big do you anticipate it being? PE – v1 has ~1.1 million conformations. v2 will have roughly same number of additional confs with 40-50 atoms, about the same size as v1. BP – we estimated 5-6 TB for wavefunctions for v1. JC – could we retrieve parts of the dataset instead of everything? Downloading 5-6 TB could be a lot. Could we split up the dataset? BP – we can certainly store 6 TB. JR – for xtb calculations we can throw away. For some DFT, for anything higher, we should store the density and/or wavefunction. JR – CBS limit, we could do two levels with different basis sets for extrapolation JR – not aware of good work of CBS of densities and wavefunctions JC – will have to do multiple calculations of subsets anyway, might as well? JR – agree JR – the best place to start is to store some subset of DFT wavefunctions, and pick the last and most-optimised structure. Or for a series of conformers, save it for the lowest energy one. I can volunteer to sketch out these heuristics JC – would we want to figure out how to do this on SPICE 1.0 since we already have that? JR – sure JC – are ESPs of interest as well? JR – we should only save wfn and re-compute it. Should be relatively easy to package code that does it easily for users BP – I can sit down and give you a storage quota. New server has ~140 TB space. We can also do an attached storage box, probably for an additional ~200TB. It would be spinning disk. BP – storing and archiving data permanently is an open question wrt QCFractal JC – general storage and distribution solutions for OpenMM, OpenFF, MolSSI? https://qcarchive.molssi.org/apps/ml_datasets/ LW – OpenFF doesn’t have a formal solution written down yet BP – a good option is to move to a static website generator from JSON blobs on a repository BP – PubChemQC is hosted on a personal sharepoint. We’re still interested in re-formatting these datasets to make them even easier to use, but we’ll leave that to the ML guy at MolSSI. We’re quite interested in PubChemQC JC – online datasets https://tdcommons.ai/single_pred_tasks/qm/ http://quantum-machine.org/datasets/ http://pccdb.org/ https://qcarchive.molssi.org/apps/ml_datasets/ BP – other datasets BP – want to compute across the periodic table to make a basis set recommender. We can’t necessarily use density fitting since that ruins the dataset Doing some work with tmQM dataset but molecules are quite big https://pubs.acs.org/doi/10.1021/acs.jcim.0c01041 Also looking at MOPAC reference dataset PE – would be good to standardise on some levels of theory JC – have you ever looked at UFF? This is a whole periodic table potential function PE – GFNFF is sort of designed to be a UFF successor. It’s a lot more accurate too BP – compute starved at the moment JC – DD can spin up workers on lilac. We should use every bit of time available. Ask Mobley about UCI too

Action items

Add Bill Swope to training

Participants

Goals

Discussion topics

Action items

Decisions

0 Comments