2021-12-03 QCA Submission meeting notes

Participants

Ben Pritchard
@Joshua Horton
@Jeffrey Wagner
@Chapin Cavender
@Pavan Behara

Goals

Updates from MolSSI
- deploying psi4 1.5, incompatibilities of qcelemental 0.24, qcengine 0.21 with qcfractal 0.15.7
Compute
- QM workers on Lilac
- QM workers on Newcastle
- QM workers on TSCC
- QM, ANI, XTB workers on PRP
  - need to expand XTB, increase memory of workers?
New submissions
- submission issues
- dipeptide dataset
  - brand_raw errors on PRP
- ML datasets for OpenMM
  - multi-molecule issues with QCElemental - psi4 bug
User questions/issues
Science support needs
- new openff-qcsubmit release
Infrastructure needs / advances
- psi4 on conda-forge

Discussion topics

Item	Presenter	Notes

Item	Presenter	Notes
Updates from MolSSI	Ben	BP: We got hammered by a bunch of requests a little while ago (like 100/second) for entire datasets. So I blocked the IP temporarily and worked with DD to find the person in OpenFF and help them access it in a more healthy way. BP: storage utilization creeping up, likely due to more wavefunctions storage do need to plan a next-gen solution JW: what kind of proactive steps can we take for 6 months to a year BP: this factors into our options at ARC (VT supercomputer center) our networking solution is somewhat better than previous, but we don’t have full control of the host BP: have some stopgap measures on storage I can take, but they are painful DD: can some of the NIH supplement be used for capital expense? BP: pretty sure no JW: there may be in-kind support OpenFF can provide re: hardware JW: where are the wavefunctions coming from? DD+JH: coming from pubchem single-points sets (OpenMM sets), more coming JW: need to track down if Peter+John have funding for their pubchem set that will drive storage needs JH: looking at ~1M calculations with wavefunctions JW: If this is just a matter of ordering 20TB of SSD storage then we can just go straight ahead with this. DD: It’s not just a short-term question - 20TB may get us a solution in the short term, but we’ll want to do a more sustainable solution in the long term. Could look at options for continued supercomputer hosting, bare metal hosting in MolSSI office, or cloud hosting. BP – If each wavefunction is ~1MB, and we do a million of them, then that’s a terabyte. If there’s 1000 basis functions, then that’s a bit high…
Compute		DD – QM workers on lilac weren’t given time to clean up. Led to weird job statuses. I’ve opened an PR on QCFractal to mitigate this. JH – We still have QM workers running on newcastle. They timed out today so I’ve resubmitted them DD – Great. We may want to switch to XTB, but let’s discuss that later. CC – TSCC is running right now - One job with 8 workers. I can spin up more if needed. DD – Feel free to spin up more. We’re making forward progress, but more resources would be great. DD – We have QM, ANI, and XTB workers on PRP. DD – With XTB, we have two datasets that are error cycling, and seem to have memory issues? PB – I’m not sure whether it’s a memory issue. The error messages aren’t clear. DD – Memory issues are my first guess, I wonder if they’re getting killed by the queue scheduler for having memory usage too high. My PRP workers have 32GB of memory. PB – 32GB should be fine. DD – JH, do I recall that newcastle was having memory issues with XTB workers? JH – For us it was ANI workers having memory problems. I’ll switch these over to XTB DD – Thanks. I’ll tag them as `openff-xtb`. Should be updated in a few hours. JW – Is it possible that xtb is just ignoring our memory limits? DD + JH – We’re not sure whether XTB is passed the memory limit from QCEngine. DD – #223 had had error cycling turned off for a few days to see if the same jobs were killing the workers repeatedly. I’ll turn error cycling back on
New datasets		JH will take over on Folmsburg Huchison test set Dipeptide 2-D TorsionDrives Large number of errors (>4000) with return message `None` Errors with brand field from PRP manager Workers on TSCC have low error rate (<5%) CC will deploy additional managers on TSCC DD will debug errors for `openff-tscc` compute tag on PRP JH resolved compute issues with OpenMM solvated amino acid dataset
Psi4 update		DD – Problem with basis sets deploying psi4 1.5, incompatibilities of qcelemental 0.24, qcengine 0.21 with qcfractal 0.15.7 DD – So, new psi4 needs new QCEl and QCEngine but production QCFractal needs old QCEl and QCEngine DD – Can we confirm that the second point is true? BP – The intercompatibility isn’t too bad, it may just work. JH + PB – We could run it with the old versions of everything, just need to set `wcombine=False` JH – The keyword probably isn’t harmful, so it’s be safest to BOTH update the workers and ALSO submit a dataset with the new keyword. But in a pinch, just resubmitting with the new keyword is a good solution. PB – Agree. PB will modify the prepared PRs (like pubchem set 2) to have `wcombine=False`, and then submit them to make sure that they don’t have the problem. If that works, we’ll make a new submission for the dipeptides which also has the updated keywords. PB – We’ll want to be careful with this, this is 100k records so it’ll be a bit wasteful if it is still broken DD – Is there any other reason that we should update to Psi4 1.5? (General) – There’s no big motivating need for this. DD – Do we know if there’s a fundamental incompatibility between Psi4 1.5 and the QC stack? BP – I don’t expect that there’d be an issue but I need to test. The risk is that QCEngine may send back a schema that QCF doesn’t understand. DD will test the new versions against each other
User questions?
Science support?		JH: new qcsubmit release out (0.3.0); solvated amino acids issue addressed
Infrastructure support		JW: Matt is making forward progress on some upstream items that mostly just require technical solutions

Action items

@David Dotson will prepare PR with latest QCEngine, QCElemental, Psi4 on QCFractal

@David Dotson will start up local manager for dipeptide error observation

@Joshua Horton will swap out QM workers on NewCastle with XTB workers; try for high memory per task if possible

@David Dotson will double memory request of XTB workers on PRP, target openff-xtb

@Chapin Cavender will deploy additional managers on TSCC resources for dipeptide dataset

2021-12-03 QCA Submission meeting notes

Participants

Goals

Discussion topics

Action items

Decisions