2021-12-17 QCA Submission meeting notes

Participants

  • @Pavan Behara

  • @Chapin Cavender

  • @Joshua Horton

  • Ben Pritchard

  • @David Dotson

Goals

  • Updates from MolSSI

    • how fast is storage filling from wavefunction-storing single-point sets?

  • Compute

    • QM workers on Lilac

    • XTB workers on Newcastle

    • QM workers on TSCC

    • QM workers on UCI

    • QM, ANI, XTB workers on PRP

  • New submissions

    • submission issues with OpenMM datasets - Gateway Timeouts

      • update on behavior and workaround

    • dipeptide dataset

      • v1.1

  • User questions/issues

  • Science support needs

  • Infrastructure needs / advances

    • psi4 on conda-forge

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Updates from MolSSI

Ben

  • BP – if you submit an optimization, the protocols part of a QC spec is ignored

    • there is utility for it, e.g. STDOUT flag

    • planning to put that into QCElmental/QCSchema

  • BP – do we want torsiondrive protocols?

    • DD – presumably yes, can think of e.g. wanting wavefunction for final gradient of each optimization

      • if we can structure protocols to accommodate future user requests to modify behavior of torsiondrives, optimizations, that would be ideal; not sure how easy that is though

    • JH – if we’re calculating the wavefunction for every gradient and only saving the last one, that would allow for using the wavefunction of the last gradient as the SCF guess for the next gradient

    • BP – also can accommodate torsiondrives as a procedure now that it’s in QCEngine

  • BP – working to make the database more constrained, more normal-form

    • have several torsiondrives that are missing optimizations (about 2 dozen); don’t have a good solution and can’t just drop nulls in

    • have a table that links torsiondrive to its optimizations

      • also have a table that stores index of minimum

    • DD – I think it’s fine if we drop the tables that store the minimum, and instead determine the minimum client side or through a REST endpoint and determine server-side; not much argument for storing this

  • BP – also have new statuses

    • waiting

    • running

    • errored

    • cancelled

    • complete

    • deleted

    • do we want an e.g. invalid? Something to indicate a known broken calculation?

      • a complete record could be marked as invalid if it’s found to be straight up wrong, for example

      • also gives us a way out for the torsiondrives with missing optimizations, for example

      • BP – could also create a one-to-many table that can take multiple comments on a record

  • PB – can we make collections invalid, or just records?

    • BP – that can be done! Haven’t gotten into collections yet.

      • will be hitting that in January

      • versioning, snapshotting, etc.

      • DD – I think supporting versioning, status as separate fields would be a boon; will work with you on this in January.

  • BP – thinking of holding a virtual workshop, use it to show off the way new QCArchive works, tentatively in February

  • DD – how quickly is storage filling up due to wavefunction storage? Have pubchem set going and want to ensure we won’t overwhelm capacity over the break.

    • BP – about 5MB per calculation

    • only have about 1.5TB of storage on the SSD

  • DD – I’ll notify John that we will trim off orbitals and eigenvalues from these submissions, then begin submitting them

Compute



  • PB – if folks want the orbitals, we may get a request later for orbitals.

    • would be wise to include Peter in the discussion

    • DD – will put together a message to Peter and John to begin using the pubchem set 1 dataset and determine if they need wavefunction data for their use case. Will inform whether we proceed with wavefunctions on the remaining 5 datasets. Will note storage projections as to why this is important to answer.

Task submission slowness

 

  • DD – will make a PR to QCFractal for submission optimizations; corresponding PR to QCSubmit to take advantage

    • can grin and bear it with current submissions for now

  • BP – restarting database with postgres logging enabled; will see if we’re missing a key index on a table; if so fixing task creation slowness may be an easy fix

  • DD – running submission with tasks now

  • BP – in new version we will query for spec only once per set of tasks submitted; right now a query on the spec hits 5 indexes, each and every task

    • right now time consuming part is _create_task; considering moving this to occurring when manager requests task, not on client submission

    • DD – that would actually be a pretty great optimization, with no downside as far as I can tell; manager can afford to wait on first call, but in steady-state operation calls for X tasks at a time, so wouldn’t see slowdown really; would save the client an immense amount of time submitting.

Walkthrough of current work on QCFractal

Ben

  • Organized by entity, not by system component.

    • e.g. molecules have everything in same directory, including models, REST route, postgres storage schema, etc.

  • molecules have a mutable identifier field that can be queried

    • useful for adding e.g. CMILES directly to a molecule

  • client has separation between get_* and query_* methods

    • getters allow you to get things by id only, in order

    • query allows by field, but order of course not guaranteed, max number of results, can be paginated

  • tasks hidden from users, attached to records due to 1:1 relationship

  • records keep their series of errors, allowing for in-server error cycling

  • user management can be done from client; using RBAC (role-based access control) as the basis

    • admin, read, monitor, compute, and submit roles

  • switched from tornado to flask, using java web tokens instead of sending credentials with every request

  • JH – can you query by molecule identifiers?

    • BP – currently can do by id

    • if I can query by e.g. Inchi, then get say all optimizations that use that molecule

    • DD – perhaps we can stack some methods on Molecule itself that lets you do this; might require either subclassing or monkey-patching QCElementals' molecule object.

  • BP – deduplication is tricky, in particular for the trajectory of an optimization

    • for optimizations, we now create entirely new gradients

    • avoids issue of not being able to resubmit calculations that have the same set of hashed attributes, but e.g. a different psi4 version that fixes a bug

Action items

@David Dotson will engage John and Peter regarding need for wavefunctions for pubchem sets; ask them to begin using pubchem set 1 for downstream work if it helps answer this question
@David Dotson will make PRs against QCFractal and QCSubmit implementing submission optimizations, aiming for merge and release in early January
@David Dotson will finish manager signal handling PR for QCFractal for January release
@David Dotson will work with Ben on collection status, version field in January

Decisions

Â