2022-02-01 QC Meeting notes

Date

Feb 1, 2022

Participants

@Pavan Behara
@Jeffrey Wagner
@David Dotson
@Chapin Cavender
BenjaminPritchard
@Joshua Horton

Goals

Updates from MolSSI
New submissions
- OpenFF ESP Industry Benchmark Set v1.0 - 56K molecules with wfns, 25 or more heavy atoms
- Modified submissions: SPICE sets v1.2 (Huge thanks to David!!)
  - Pubchem sets are still pretty large to submit using github actions' 6 hour job execution limit
Throughput status
- Openff dipeptides torsiondrives v1.1: 5/5 TD COMPLETE!!
- Openff dipeptides torsiondrives v2.0: 19/26 TD complete from 1/26 last week
- SPICE DES370K Single Points Dataset v1.0 - 234K+ ~ 67% complete (small mols)
- SPICE Dipeptides Single Points Dataset v1.2 - started yesterday but high error rate, 1400 complete and 12K errors - a mix of known PRP issues, and also qcengine: unknown error
  - are we running PRP workers now?
  - may need full-node/high-mem worker config, 40cores/180GB or similar
User questions/issues
- End of life for defunct label?
- Scientific review for https://github.com/openforcefield/qca-dataset-submission/pull/223 ?
  - Charged mols issue with GFN-FF0, 411 complete/260 incomplete, looks resolved but no release on their end
  - ANI-2x errors - 0 TDs complete, expected?
- Reducing repo size
  - Migrating files to LFS may still leave the history larger in size
  - In future submissions if dataset.* files are pushed to LFS do we need any modification to our submission/validation scripts? Or, can we do a post-submission repo cleanup once in a while?
Science support needs
Infrastructure needs/advances
- QCEngine new version v0.22.0 released with David’s bug fix to resolve pycpuinfo issue

Discussion topics

Item	Notes

Item	Notes
Updates from MolSSI	BP – Growth in server usage has stopped. We’re at 88% utilization. I could move some things to the spinning disk storage, but I can’t move the wavefunction table. kvstore (outputs) good candidate for this BP – I’m looking at solutions here - Speccing out machine/storage that will handle this better. The base results table is about 1 TB, which seems large. BP – I’m running through a stress test/demo with the new release, which is going well. experimental deployment of new QCFractal server Tested out deleting calculations - Seems to be working DD: this is fantastic; will really help with submission anxiety knowing we can hit undo! JW – are you viewing the storage issue as something separate from the next-generation machine? BP – No, I’m looking at procurement of a big machine for our purposes/storage. There’s a funding opportunity for compute that we may jump on. BP – I’m curious about the number of calculations you’re planning to run. I’m seeing about 3 million wavefunctions requested currently, and I think you’re planning on running pubchem which is another million. And I see 30 milliion optimizationprocedures. Also the idea of external records, like stuff from gaussian, will use a lot of space. `native_files` usage could cause a ballooning of space too; these are package-specific files that don’t get extracted into a QCSchema form; just a bytestream packed into the DB JW – from fitting people: do you foresee lots of of wavefunction demand? PB – For charge model work, we may see an uptick in wavefunctions requested from Gilson group CC – Agree, Willa will need those for polarizability work. PB – But those won’t be as big as the pubchem sets. DD – On Friday we did a hard stop of SPICE submissions, and resubmitted them to not have wavefunctions attached. BP – Current total QCA capacity is 5.2 TB SSD and 2ish TB spinning disk. DD – Could provision a lot of SSDs in the next generation. Have a storage rack in the server, basically a box of SSDs. BP – VT has a transportation institute, VTTI, and they’re paying $100k to provision a database, so we could see how they do it. But I’m not familiar with server constructions/management so I’ll either need to set aside time to learn, or work with a vendor/center to get it set up. PB – Can we delete the old pubchem dataset with the new QCF release? BP – Yes, that should be possible. So we can do that after the next release.
New submissions	OpenFF ESP Industry Benchmark Set v1.0 - 56K molecules with wfns, 25 or more heavy atoms 56k molecules DD – The PR in question requests wavefunctions on an optimization, which will get ignored. But I know that he’ll run a single point dataset at the end. BP – My rough estimate is that this would be about 300GB of wavefunctions. That’s about half of our remaining space. DD – We could put the optimization set through, which would buy us some time before calculating wavefunctions. Is there anything I can do to help get the new QCF out so that we can do deletion? BP – I don’t think there’s an easy place where you could jump in. The scary thing will be the data migration, and I don’t know that that will work on the first try. I’ve collected representatives of old data formats and tricky things to test the migration code, but I suspect things like in-progress torsiondrives will be messy. Modified submissions: SPICE sets v1.2 (Huge thanks to David!!) Pubchem sets are still pretty large to submit using github actions' 6 hour job execution limit DD – I created a PR on QCF that is a small optimization that should enable this submission, and I’ll follow that up with a PR to QCSubmit. That will make submission a lot more efficient. JW – for self-hosted Github Action runner on AWS, if we can keep below $5k for a year, easy ask DD – understood, will target < $5k annual
Throughput status	Openff dipeptides torsiondrives v1.1: 5/5 TD COMPLETE!! Openff dipeptides torsiondrives v2.0: 19/26 TD complete from 1/26 last week SPICE DES370K Single Points Dataset v1.0 - 234K+ ~ 67% complete (small mols) SPICE Dipeptides Single Points Dataset v1.2 - started yesterday but high error rate, 1400 complete and 12K errors - a mix of known PRP issues, and also `qcengine: unknown error` are we running PRP workers now? Seeing `pycpuinfo` errors DD – I’ve made a release of QCEngine that should fix this, but it’s dependent on a QCF release that I hope we can put out in the next few days. PB – Should we stop those workers until we update the production env? DD – I don’t think the pycpuinfo issues are a complete blocker - I think the majority of jobs are still completing successfully. may need full-node/high-mem worker config, 40cores/180GB or similar PB + DD – Not necessary to change resource requests for these.
User questions/issues	End of life for defunct label? DD – Yes, I’d move those to the end-of-life column, and archive the cards. Scientific review for https://github.com/openforcefield/qca-dataset-submission/pull/223 ? Charged mols issue with GFN-FF0, 411 complete/260 incomplete, looks resolved but no release on their end ANI-2x errors - 0 TDs complete, expected? JH – That one’s almost done, I just need the QM to finish. Namely I need ANI and XTB to finish. PB – Should we be looking into the errors/reruns with ANI and XTB? I assume it’s the normal geomeTRIC issue. JH – I’ll keep running managers at Newcastle. Reducing repo size Migrating files to LFS may still leave the history larger in size In future submissions if `dataset.*` files are pushed to LFS do we need any modification to our submission/validation scripts? Or, can we do a post-submission repo cleanup once in a while? JW – It’s fine to delete history, especially given the structure of how we do QC dataset submission in separate directories. DD – Yes, we should be able to delete this history after we move everything to LFS. PB – Do we do that as quarterly maintenance, or have users interact directly with git LFS? DD – That may be better as an administrative thing, that we don’t burden users with. So we as maintainers should do this, and the user submission path shouldn’t change. And I’ll be looking into the details of git LFS to see how our we should use it.
Science support needs	CC – “Dipeptide” means something different in different contexts. Should we rename the SPICE resubmission to call them “capped 1-mer”s? PB – We could update that in the README. DD – Agree, I’d just update that in the dataset description.
Infrastructure needs/advances	QCEngine new version v0.22.0 released with David’s bug fix to resolve pycpuinfo issue

Action items

@David Dotson will perform a research cycle on using git-lfs for periodic archival of old artifacts in qca-dataset-submission

@David Dotson will deploy new QCFractal to compute environments following release