2022-01-07 QC meeting notes

Participants

Ben Pritchard
@Pavan Behara
@Jeffrey Wagner
@David Dotson
@Chapin Cavender
@Simon Boothroyd

Goals

Updates from MolSSI
- how fast is storage filling from wavefunction-storing single-point sets?
Compute
- high memory/storage requirements of pubchem sets
- QM workers on Lilac
- XTB workers on Newcastle
- QM workers on TSCC
- QM workers on UCI
- QM workers on PRP
New submissions
- 40k molecules, 5 conformers each, do optimization with xtb,
- dipeptide dataset v2.0
User questions/issues
Science support needs
Infrastructure needs / advances
- psi4 on conda-forge

Discussion topics

Item	Notes

Item	Notes
General updates	DD – Moving forward, Pavan will be running this call and the QC submission call. I’ll be in attendance, but he’ll be leading it. PB – I’d like to change he meeting time to the same time on Tuesday (General) – That works PB – I’ll update the future events
Updates from MolSSI	major milestones in refactoring; changing how things are packaged, will make things a lot easier compute managers will be split out into their own package managed to run full end-to-end calculation with the new system BP – Planning an informal workshop for after the refactor is done. Maybe February? I’ll keep you informed. JW – This sounds great, I’d be happy to advertise this to our team. BP – Now that I have the manager stuff working andcan run calcs on the new version, I’d like to work with you to try running ~100k calculations and seeing how things scale. DD – I’d love that and can help. BP – I’ll share how to do this, one big new change is that everything must have tags. Previously blank tags would pull down all jobs, but now it must explicitly be set to `*` to pull down any tag. DD – Does the permission system (in terms of resources we can access) work with namespaces? So, like, if I have a user account specifically for running managers, I’d like it to pull down tasks only from my tags. I’m concerned about a malicious actor running a weird manager and returning results for my tagged calcs. BP – That won’t be done in the next release, but we can add that to the roadmap after.
Compute	high memory/storage requirements of pubchem sets DD – These jobs are fairly challenging, require tons of memory. For example, my PRP workers have 220GB of ram and 300+ GB of disk, and they’re using a lot of it. PB – Are jobs just working on a single task? DD – Yes PB – that set has molecules with ~36 heavy atoms; typically see about 50GiB usage JW – could this be a bug in `psi4` or `qcengine`? It could be that it’s ignoring resource limits. Are all the `wcombine=False` sets showing this phenomenon? DD – This is a new method/basis combo for us, so this might really be a scientific requirement. Local testing could help us understand this DD – I made a comment with the failing jobs: PB – I can run one of the failing jobs locally on the UCI cluster and see if we can make a reproducing case. PB – Is it possible that it’s a node issue and the calculations were really having a segfault? DD – I’m running managers to try and reproduce SB – were you able to isolate one of these failing jobs? (DD: yes, see link above) Could run in a docker container or a cgroup, where you can put a hard limit on the memory. DD – Tried running some on my local box here, and observed high disk usage (60-70GB files in scratch), but nothing that was approaching 200GB (like we see on managers). SB – Were these large files log files, or something else? PB – I think they’re psi4 binary files, like 2-electron integrals. DD – (Runs a job locally, sees local scratch file balloon in just a few minutes) SB – Is this file supposed to exist in normal circumstances? Or only when it runs out of memory? PB – I think this would be normal operation, (something something scf, two electron integrals…) SB – Maybe worth pinging on psi4 slack? PB – I’ll ping on the forum. SB – It may be good to estimate how much we’re spending on this (both in terms of time and compute resources) and determine whether this is worth the cost. JW – It seems like the majority of the time spent on these sets is human time - Basically, 90% of the work is “doing it the first time”, and then after that we have an automated path and we can shove a ton more stuff through. But with the PubChem sets it may be reasonable to set a deadline for when we stop putting human time towards it, since this has started to cost a lot. DD – It’s also worth considering the other resource usage, like the disk space usage of the wavefunctions. BP – Pubchem set 1 is looking like ~500GB almost entirely because of wavefunctions. We have about 1.2TB of space available so that should fit, but it starting to get crowded. JW – I’ll be happy to advocate to help solve the storage problem from within OpenFF – Just let me know when you’ve picked a path BP (local server vs. amazon). QM workers on Lilac XTB workers on Newcastle QM workers on TSCC QM workers on UCI QM workers on PRP
New submissions	submission issues with OpenMM datasets - Gateway Timeouts update on behavior and workaround DD – Pavan, I’ve just submitted your protomer set PB – Thanks DD – Are eastman sets ready for submission? I’ve removed the spec for orbitals and eigenvalues PB – Yes, they should be good to go dipeptide dataset v1.1 - Good progress, 3 of 5 TorsionDrives complete v2.0 - Expanded number of amino acids from 3 to 26, PR here: v2.0 dataset validation fails due to file rename. This seems to be a known bug in `trilom/file-changes-action` v1.2.3, fixed in v1.2.4 DD – Nice debugging, could you update this in the submission PR? CC – Yes DD – PB, can you review this, or should I? PB – I’ll review it, or tap another person if I don’t have time.
New submissions	SB – There’s one that I’m considering, but I’m not sure how to move forward. I want to generate a lot of wavefunctions - Basically take 40000 molecules with up to 12 heavy atoms, make 5 confs of each, then optimize (either two-stage like XTB-then-HF-6-31G), then get wavefunctions of the final results. DD – This could be a challenge. Would need to be a multi-operation dataset. So, it’d need to start with an XTB optimization dataset, then the results of that would need to go to a HF-6-31G optimization set, then a single point wavefunction set SB – QCSubmit has some new API points that should help start optimizations from the completed results of a previous set. Though I’m kinda wondering whether this will conform with the provenance requirements for our datasets/general QCFractal use. SB – So, in terms of operations, do we anticipate size issues for submission? Storage? And what would it look like timeline-wise? Compute time DD – Timeline-wise, with 6-31G, I think Hyesu may have done some… Could you try running the optimizations on lilac, with the goal of preparing the inputs for wavefunction calcs? Storage space 200,000 wavefunctions x 5 MB = 1 TB BP – 6-31G would be smaller then def2-tzvp, and 12 heavy atoms is also smaller than pubchem. So we may not hit any giant storage space issues here. Dataset size SB – It wouldn’t be any trouble to break this into smaller datasets.

Action items

@Pavan Behara will reschedule QCArchive submission and user group calls for Tuesdays, 8am PT

@David Dotson will work with Ben Pritchard to do burn tests with new QCFractal instance, calculations mimicking production

@Pavan Behara will attempt to reproduce high memory requirements of certain SPICE records, e.g. #257; raise with psi4 devs

@David Dotson will keep high-memory nodes up for SPICE and monitor usage; if it looks like we are past calculations with high requirements, will re-deploy many smaller workers

@David Dotson will deploy workers with priority for new dipeptide submission

@David Dotson will push remaining pubchem set submissions through local infrastructure

@Simon Boothroyd will prepare HF-6-31G* set on Lilac, prepare single point dataset from final conformers; may need to be multiple datasets at 200,000 calculations

2022-01-07 QC meeting notes

Participants

Goals

Discussion topics

Action items

Decisions