2022-03-01 QC Meeting notes

Date

Mar 1, 2022

Participants

@Pavan Behara
@Chapin Cavender
@Trevor Gokey
Ben Pritchard
@Jeffrey Wagner
@Joshua Horton

Goals

Updates from MolSSI
Infrastructure needs/advances
- David Dotson moved files larger than 10MB to git-lfs and the repo size dropped to 108MB
  - errorcycling and other github actions are also updated to sync up with the change and work perfectly
  - re-clone repo, and start new PRs for any relevant ones closed
- New optimization schema with scf properties evaluated at optimized geometry
  - one extra step of doing single point at the final geometry, acceptable cost, any thoughts?
Throughput status
- OpenFF ESP Industry Benchmark Set v1.0: 52534 from 16170 ~95% from 28% last week
- Openff dipeptides torsiondrives v2.1: 24/26 TD complete
  - slowmoving - 577 opts since last week, around 2230+ new opts for the last two modified submissions, looks like going in the right direction
  - CC: Out of two one may error out so if we reach 25/26 we can move this to end of life
- OpenFF Protein Capped 1-mer Sidechains v1.0: 1/46 TD
  - 70793 from 63554 (upwards of 7293) optimizations last week
  - around 1500+ opts per torsionscan done, nearly 3 opts per grid point (575 grid points IIRC)
- SPICE PubChem Set 1 Single Points Dataset v1.2: 8.5% from last week
  - Lilac compute fully dedicated to spice sets now
- SPICE Dipeptides Single Points Dataset v1.2: COMPLETE from 99%
New submissions
- Pubchem set1 submitted (Thanks to David!)
- Modified submissions: SPICE sets v1.2 (other pubchem sets in queue)
- SPICE DES370K Single Points Dataset supplement v1.0 (submitted)
  - modified spec from spice_default to spice_default_no_mbis , but submission and errorcycling show spice_default(?)
User questions/issues
- Can we map task id with the dataset name? Or, adding --verbose flag is the only option to see what jobs are being executed on the queue?
Science support needs

Discussion topics

Notes

Notes

BP: Running out of space and thinking to prune Mayer and Wiberg indices, size NxN (N= # of atoms)
BP – These get stored for every gradient calculation. Deleting the indices from the ESP dataset will free up ~100GB. For some reason, it looks like they’re sometimes duplicated.
- JW – Is there a way to only store this for the final conformation?
- BP – No
BP – For other datasets, I could delete all the bond indices except for the final step?
- PB – So future jobs would test whether they’re at the final step and only save the wiberg+meyer info then?
  - BP – No, I’m not changing how they’re stored, I’m just going through completed datasets and clearing them out
- TG – Could we also keep the meyer+wiberg info at the first step?
  - BP – Yes
- BP – From torsiondrives and optimizations, could I delete meyer+wiberg except for first and last conf for each?
  - TG – Yes, any time we’re looking at trajectories or torsiondrives, we don’t need intermediate bond info except at beginning and end.
  - PB – I think it’s fine to delete meyer+wiberg info from those places
Wavefuncitons
- JW: I recall wfns taking up too much space, is that still an issue?
- BP: Yeah, no datasets are evaluating wfns.
- https://github.com/openmm/spice-dataset/issues/11#issuecomment-998254619
BP – It is possible to run an optimization and not keep all gradients. Could I delete intermediate gradient info from opt trajectories?
- TG – We’d looked at using this info, didn’t bear much fruit, but we’re still looking at how we can subsample
- JH – I think espaloma needs these intermediates.
- BP – In the future, we might think about more targeted datasets, and having explicit options to store gradients/bond indices only for the first/last steps.
BP – So, I’m going to
- delete bond indices from first SPICE dataset,
- will write a script to delete all bond indices except from first and last optimization steps (and for torsiondrives)
- Try deleting wavefunctions from first SPICE dataset
- PB – I’ll confirm with espaloma team that the above is OK, then will notify BP to go ahead with the above deletion
BP – For the migration, I have optimization, basic datasets pretty close to ready. Reaction datasets are going to be complex.
- JW – Want to let us know when you’ll do another demo and we can bring in science folks?
- BP – Yeah, let’s do next week
BP – I talked to our hardware vendor about buying a new server, waiting to get quotes.
- JW – I’d be interested to hearing about what they say, let me know about quotes.
JW – Would we want to come up with a step for dataset submission where
- BP – Wavefunctions: N^2 number of basis functions
- BP – Bond indices: N^2 times number of gradient steps…
- BP – Output: ???
- DD – I think increasing storage is the better option
- BP – Also, I think a policy of “only computing what you actually need” is a good idea. Not “well, I might be able to use this in 5 years”
- DD – What do the defaults look like?
  - PB – We’re explicitly feeding in
- BP – In an OptimizationProtocol, the valid keywords are something like none all first last and first and last. So those could be used immediately if we want.
TG – If the bond indices are doubles, I’d be OK to demote them to float. We don’t use that many significant figures on those.
- PB – Yeah, we’re only really interested in the first ~two decimal places.

DD – An update to everyone here - Users of qca-dataset-submission will need to re-clone the repo now that we’re using git lfs. The updated instructions are in the README. Also need to install git-lfs on your user machine to make proper use: Git Large File Storage

Meetings

2022-03-01 QC Meeting notes

Date

Participants

Goals

Discussion topics

Action items

Decisions

Related content