2021-12-03 QCA Submission meeting notes

Participants

  • Ben Pritchard

  • @Joshua Horton

  • @Jeffrey Wagner

  • @Chapin Cavender

  • @Pavan Behara

Goals

  • Updates from MolSSI

    • deploying psi4 1.5, incompatibilities of qcelemental 0.24, qcengine 0.21 with qcfractal 0.15.7

  • Compute

    • QM workers on Lilac

    • QM workers on Newcastle

    • QM workers on TSCC

    • QM, ANI, XTB workers on PRP

      • need to expand XTB, increase memory of workers?

  • New submissions

    • submission issues

    • dipeptide dataset

      • brand_raw errors on PRP

    • ML datasets for OpenMM

      • multi-molecule issues with QCElemental - psi4 bug

  • User questions/issues

  • Science support needs

    • new openff-qcsubmit release

  • Infrastructure needs / advances

    • psi4 on conda-forge

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Updates from MolSSI

Ben

  • BP: We got hammered by a bunch of requests a little while ago (like 100/second) for entire datasets. So I blocked the IP temporarily and worked with DD to find the person in OpenFF and help them access it in a more healthy way.

  • BP: storage utilization creeping up, likely due to more wavefunctions storage

    • do need to plan a next-gen solution

    • JW: what kind of proactive steps can we take for 6 months to a year

    • BP: this factors into our options at ARC (VT supercomputer center)

      • our networking solution is somewhat better than previous, but we don’t have full control of the host

    • BP: have some stopgap measures on storage I can take, but they are painful

      • DD: can some of the NIH supplement be used for capital expense?

        • BP: pretty sure no

      • JW: there may be in-kind support OpenFF can provide re: hardware

    • JW: where are the wavefunctions coming from?

      • DD+JH: coming from pubchem single-points sets (OpenMM sets), more coming

    • JW: need to track down if Peter+John have funding for their pubchem set that will drive storage needs

      • JH: looking at ~1M calculations with wavefunctions

    • JW: If this is just a matter of ordering 20TB of SSD storage then we can just go straight ahead with this.

    • DD: It’s not just a short-term question - 20TB may get us a solution in the short term, but we’ll want to do a more sustainable solution in the long term. Could look at options for continued supercomputer hosting, bare metal hosting in MolSSI office, or cloud hosting.

    • BP – If each wavefunction is ~1MB, and we do a million of them, then that’s a terabyte. If there’s 1000 basis functions, then that’s a bit high…

Compute

 

  • DD – QM workers on lilac weren’t given time to clean up. Led to weird job statuses. I’ve opened an PR on QCFractal to mitigate this.

  • JH – We still have QM workers running on newcastle. They timed out today so I’ve resubmitted them

  • DD – Great. We may want to switch to XTB, but let’s discuss that later.

  • CC – TSCC is running right now - One job with 8 workers. I can spin up more if needed.

    • DD – Feel free to spin up more. We’re making forward progress, but more resources would be great.

  • DD – We have QM, ANI, and XTB workers on PRP.

  • DD – With XTB, we have two datasets that are error cycling, and seem to have memory issues?

    • PB – I’m not sure whether it’s a memory issue. The error messages aren’t clear.

    • DD – Memory issues are my first guess, I wonder if they’re getting killed by the queue scheduler for having memory usage too high. My PRP workers have 32GB of memory.

    • PB – 32GB should be fine.

    • DD – JH, do I recall that newcastle was having memory issues with XTB workers?

    • JH – For us it was ANI workers having memory problems. I’ll switch these over to XTB

    • DD – Thanks. I’ll tag them as openff-xtb. Should be updated in a few hours.

  • JW – Is it possible that xtb is just ignoring our memory limits?

    • DD + JH – We’re not sure whether XTB is passed the memory limit from QCEngine.

  • DD – #223 had had error cycling turned off for a few days to see if the same jobs were killing the workers repeatedly. I’ll turn error cycling back on

New datasets

 

  • JH will take over on Folmsburg Huchison test set

  • Dipeptide 2-D TorsionDrives

    • Large number of errors (>4000) with return message None

    • Errors with brand field from PRP manager

    • Workers on TSCC have low error rate (<5%)

    • CC will deploy additional managers on TSCC

    • DD will debug errors for openff-tscc compute tag on PRP

  • JH resolved compute issues with OpenMM solvated amino acid dataset

Psi4 update

 

  • DD – Problem with basis sets deploying psi4 1.5, incompatibilities of qcelemental 0.24, qcengine 0.21 with qcfractal 0.15.7

  • DD – So,

    • new psi4 needs new QCEl and QCEngine

    • but production QCFractal needs old QCEl and QCEngine

  • DD – Can we confirm that the second point is true?

  • BP – The intercompatibility isn’t too bad, it may just work.

  • JH + PB – We could run it with the old versions of everything, just need to set wcombine=False

    • JH – The keyword probably isn’t harmful, so it’s be safest to BOTH update the workers and ALSO submit a dataset with the new keyword. But in a pinch, just resubmitting with the new keyword is a good solution.

    • PB – Agree.

  • PB will modify the prepared PRs (like pubchem set 2) to have wcombine=False, and then submit them to make sure that they don’t have the problem. If that works, we’ll make a new submission for the dipeptides which also has the updated keywords.

  • PB – We’ll want to be careful with this, this is 100k records so it’ll be a bit wasteful if it is still broken

  • DD – Is there any other reason that we should update to Psi4 1.5?

    • (General) – There’s no big motivating need for this.

  • DD – Do we know if there’s a fundamental incompatibility between Psi4 1.5 and the QC stack?

    • BP – I don’t expect that there’d be an issue but I need to test. The risk is that QCEngine may send back a schema that QCF doesn’t understand.

  • DD will test the new versions against each other

User questions?

 

 

Science support?

 

  • JH: new qcsubmit release out (0.3.0); solvated amino acids issue addressed

Infrastructure support

 

  • JW: Matt is making forward progress on some upstream items that mostly just require technical solutions

Action items

@David Dotson will prepare PR with latest QCEngine, QCElemental, Psi4 on QCFractal
@David Dotson will start up local manager for dipeptide error observation
@Joshua Horton will swap out QM workers on NewCastle with XTB workers; try for high memory per task if possible
@David Dotson will double memory request of XTB workers on PRP, target openff-xtb
@Chapin Cavender will deploy additional managers on TSCC resources for dipeptide dataset

Decisions

Â