2020-11-20 QCA Submission Meeting notes

Date

Nov 20, 2020

Participants

@David Dotson
@Simon Boothroyd
Ben Pritchard
@Trevor Gokey
@Pavan Behara
@Joshua Horton

Goals

New advancements
- PCM-based implicit solvent pathway
New submissions
- SB: OpenFF BCC Refit Study COH (submitted!)
- PB: Genentech OptimizationDataset (in preparation)
- JH: (ANI, ANI1Cxx) (ready to submit when desired)
- DD: PEPCONF OptimizationDataset (submitted!)
  - Can we identify why it appears to be progressing slowly?
- DD: Jessica Maat’s Phenyl resubmission (submitted!)
- PA/JH: Protonation/tautomer state enumeration dataset (ready for review/ needs name update)
- DD: Run MM on the sandbox dataset (submitted!)
Compute bottlenecks
- Do we need to pursue more compute?
- Open discussion on strategies for meeting user timelines, managing expectations.
Upcoming infrastructure improvements
- STANDARDS-based versioning #137
- Dataset index on qca-dataset submission #147
- Local Optimization executor
- Do we need to change the index system we use for molecule submissions?
- More targeted error cycling; what else do we need in the report for decision-making?
Upcoming science support
- Enforced c1 symmetry in psi4 is almost ready
Larger advances
- Automated FF coverage gap identification, torsion prioritization, submission generation
- Benchmarking (dashboard, etc.)

Discussion topics

Item	Presenter	Notes

Item	Presenter	Notes
PCM-based implicit solvent	Simon	PCM appears to be working on the COH submisson JH: Also first dataset storing wavefunctions/eigenvalues, so another first SB: storage and retrieval working just fine! DD: would be worth showing this off at next show-and-tell; I’ll find out from Jeff
Submissions		SB: COH is about 50% complete don’t have error cycling in place for basic DataSets yet; will get today PB: genentech optimization; working on first submission. Only 20% of dataset would be submitted in this first run is this acceptable? Yes, we’ll proceed with the smaller, 127 molecule (20%) subset for the first submission DD: feel free to reach out to me when desired; we’ll re-roll the PR off of master (DD messed up long-lived branches with squash merges) PB: protomers/tautomers JH: fewer tasks than there are conformers; due to QCF index not being case sensitive, and some of the SMILES clash when reduced to lowercase Do we have another solution? Do we drop the use of SMILES for the index? TG: For torsiondrives, this is still useful. JH: Still want to be able to group molecules that are just peer conformers JH: change how we index molecules, just do `molecule-0`, `conformer-0`, basically avoid SMILES for OptimizationDatasets, Basic DataSets; keep SMILES as index on `TorsionDrive`s TG: May still run into issues on this with `TorsionDrive`s, but like this because we tag the driven torsion DD+BP: Could also go with removing the lowercase-casting on indices; would be almost a trivial change, and non-destructive for database access (we’ll pursue this) Issue raised: DD: PEPCONF We’re getting some user pressure; why is it proceeding slowly? Decide on a rebalancing of priorities for datasets: reduce priority to low for some optimization sets TG: Many of these molecules will take a lot of memory > 50GiB DD: Perhaps time to scale up all our nodes to a minimum amount of memory for QM jobs Do we know if there are ways to reduce the memory usage of Optimizations? BP: Psi4 can write to disk if needed when memory gets constrained DD: I will reduce the memory offered to the manager to below the constraints given to each worker; may trigger writing to local storage also increase the total memory of each replica to 64GiB Could also scale the CPUs to 32, perhaps even 64 We’ll increase the priority of PEPCONF to high TG: will reduce number of workers deployed, see if this reduces pre-emption frequency Phenyl Dataset - will start to starve others DD: I’ll touch base with Jessica, find out timeline needs for Phenyl set
Strategies for user timelines, expectations	David	JH: I think we can be faster in merging datasets now, especially with STANDARDS coming into place DD: we’re already defaulting to ‘high’ priority for fitting datasets, more discretionary for others JH: Some of the datasets were from PI pressure to get things running; could be re-tagged to ‘low’ priority DD: compute tags are an avenue for controlling flows, but dangerous if we park tasks in a compute tag for which we have no managers
Dataset index	Josh	Probably good to merge; can’t find the script used to generate DD: we can merge and manually curate for now, add automation later
Error Cycling	David	TG: Restarts of SCF convergence, optimization convergence appear to clear often enough, probably don’t want to exclude these High memory for psi4 can be dealt with through better configuration of workers (setting memory available to less than memory allocated on the node) DD: We’ll close for now; can chew on more ways to utilize compute tags for routing, how we want to filter error cycling
Enforced C1 symmetry	Josh	C1 symmetry is coming in Psi4, old datasets where we didn’t do this will still work if method requires a specific symmetry, psi4 will set it itself

Action items

@David Dotson will get next show-and-tell date from @Jeffrey Wagner, relay to group for PCM, wavefunction demonstration

@David Dotson will add in error cycling for basic DataSets to lifecycle

@Pavan Behara will proceed with Genentech dataset, with initial submission only including smaller molecules (~20% of the full dataset); reach out to @David Dotson for help fixing the branch/PR when ready

@Trevor Gokey will experiment with reducing the number of workers deployed on pre-emptible queues, see if this positively impacts pre-empt frequency; potentially reach out to admins for assistance

@David Dotson will re-work PRP deployment of QM workers with manager limits below those given to the container; use fewer CPUs, more memory per replica, more replicas

@David Dotson will touch base with Jessica Maat on timeline needs for Phenyl set; assess priority of other sets relative to it

@David Dotson will review and merge the index on qca-dataset-submission; create issue for automated curation

@David Dotson will tag the PEPCONF dataset with priority “high”

@David Dotson will add compute tagging on the basis of submission priority-* GH tag to lifecycle error cycling