Page Comparison

...

Item

Presenter

Notes

Updates from MolSSI

Ben

BP: We got hammered by a bunch of requests a little while ago (like 100/second) for entire datasets. So I blocked the IP temporarily and worked with DD to find the person in OpenFF and help them access it in a more healthy way.
BP: storage utilization creeping up, likely due to more wavefunctions storage
- do need to plan a next-gen solution
- JW: what kind of proactive steps can we take for 6 months to a year
- BP: this factors into our options at ARC (VT supercomputer center)
  - our networking solution is somewhat better than previous, but we don’t have full control of the host
- BP: have some stopgap measures on storage I can take, but they are painful
  - DD: can some of the NIH supplement be used for capital expense?
    - BP: pretty sure no
  - JW: there may be in-kind support OpenFF can provide re: hardware
- JW: where are the wavefunctions coming from?
  - DD+JH: coming from pubchem single-points sets (OpenMM sets), more coming
- JW: need to track down if Peter+John have funding for their pubchem set that will drive storage needs
  - JH: looking at ~1M calculations with wavefunctions
- JW: If this is just a matter of ordering 20TB of SSD storage then we can just go straight ahead with this.
- DD: It’s not just a short-term question - 20TB may get us a solution in the short term, but we’ll want to do a more sustainable solution in the long term. Could look at options for continued supercomputer hosting, bare metal hosting in MolSSI office, or cloud hosting.
- BP – If each wavefunction is ~1MB, and we do a million of them, then that’s a terabyte. If there’s 1000 basis functions, then that’s a bit high…

Compute

DD – QM workers on lilac weren’t given time to clean up. Led to weird job statuses. I’ve opened an PR on QCFractal to mitigate this.
JH – We still have QM workers running on newcastle. They timed out today so I’ve resubmitted them
DD – Great. We may want to switch to XTB, but let’s discuss that later.
CC – TSCC is running right now - One job with 8 workers. I can spin up more if needed.
- DD – Feel free to spin up more. We’re making forward progress, but more resources would be great.
DD – We have QM, ANI, and XTB workers on PRP.
DD – With XTB, we have two datasets that are error cycling, and seem to have memory issues?
- PB – I’m not sure whether it’s a memory issue. The error messages aren’t clear.
- DD – Memory issues are my first guess, I wonder if they’re getting killed by the queue scheduler for having memory usage too high. My PRP workers have 32GB of memory.
- PB – 32GB should be fine.
- DD – JH, do I recall that newcastle was having memory issues with XTB workers?
- JH – For us it was ANI workers having memory problems. I’ll switch these over to XTB
- DD – Thanks. I’ll tag them as openff-xtb. Should be updated in a few hours.
JW – Is it possible that xtb is just ignoring our memory limits?
- DD + JH – We’re not sure whether XTB is passed the memory limit from QCEngine.
DD – #223 had had error cycling turned off for a few days to see if the same jobs were killing the workers repeatedly. I’ll turn error cycling back on

New datasets

JH will take over on Folmsburg Huchison test set
Github link macro
link https://github.com/openforcefield/qca-dataset-submission/pull/255
Dipeptide 2-D TorsionDrives
- Large number of errors (>4000) with return message None
- Errors with brand field from PRP manager
- Workers on TSCC have low error rate (<5%)
- CC will deploy additional managers on TSCC
- DD will debug errors for openff-tscc compute tag on PRP
JH resolved compute issues with OpenMM solvated amino acid dataset

Psi4 update

DD – Problem with basis sets deploying psi4 1.5, incompatibilities of qcelemental 0.24, qcengine 0.21 with qcfractal 0.15.7
JH – I think we
DD – So,
- new psi4 needs new QCEl and QCEngine
- but production QCFractal needs old QCEl and QCEngine
DD – Can we confirm that the second point is true?
BP – The intercompatibility isn’t too bad, it may just work.
JH + PB – We could run it with the old versions of everything, just need to set wcombine=False
- JH – The keyword probably isn’t harmful, so it’s be safest to BOTH update the workers and ALSO submit a dataset with the new keyword. But in a pinch, just resubmitting with the new keyword is a good solution.
- PB – Agree.
PB will modify the prepared PRs (like pubchem set 2) to have wcombine=False, and then submit them to make sure that they don’t have the problem. If that works, we’ll make a new submission for the dipeptides which also has the updated keywords.
PB – We’ll want to be careful with this, this is 100k records so it’ll be a bit wasteful if it is still broken
DD – Is there any other reason that we should update to Psi4 1.5?
- (General) – There’s no big motivating need for this.
DD – Do we know if there’s a fundamental incompatibility between Psi4 1.5 and the QC stack?
- BP – I don’t expect that there’d be an issue but I need to test. The risk is that QCEngine may send back a schema that QCF doesn’t understand.
DD will test the new versions against each other

User questions?

Science support?

JH: new qcsubmit release out (0.3.0); solvated amino acids issue addressed

Infrastructure support

JW: Matt is making forward progress on some upstream items that mostly just require technical solutions

...

David Dotson will prepare PR with latest QCEngine, QCElemental, Psi4 on QCFractal
David Dotson will start up local manager for dipeptide error observation
Joshua Horton will swap out QM workers on NewCastle with XTB workers; try for high memory per task if possible
David Dotson will double memory request of XTB workers on PRP, target openff-xtb
Chapin Cavender will deploy additional managers on TSCC resources for dipeptide dataset

Versions Compared

Old Version 5

New Version 6

Key

Decisions