2023-07-25 OpenFF QCA Working Group Meeting Notes

Participants

Goals

PortalClient Training
- BP will link slides here
- PB – Are we still splitting dispersion calculation from functional?
  - BP – Good question. It’s all together in the new version. So if you’re running a lot of different functionals with dispersion correction, it may make sense to add a new specification for the dispersion correction.
- PB – Can we access the new datasets in the new server?
  - BP – Yes, the new hardware is running a copy of the old server and the new server, and you can use either the old or new client to access it. We also have an ML version - Splitting out the old OpenFF instance, now have an OpenFF instance, a MolSSI instance, and an ML instance.
- DD – And to emphasize - We should only be submitting using the NEW client to the NEW server, right?
  - BP – yes
Tutorial
- For copy and paste: mamba create -n qcportal-tutorial -c qcarchive/label/next -c conda-forge qcportal nglview jupyter
- DD – Are records always integers?
  - BP – Yes, now they’re integers. Previously they were strings that had to contain integers.
- BP – Oh, I shoulodn’t mentioned that statuses are improved - There’s completed, running, error, waiting. There are a small number of others but I won’t mention those today.
- DD - If we have to resubmit the same calc because of dependency/infra issues?
  - BP – I hope to eventually implement support for duplicate submissions, but it’s ont there yet. Could delete the original entry or resubmit with slightly different input.
- PB - regarding error messages, we don't have to do client.query_kvstore() anymore, right?
- PE – Can we further elaborate on program in specification, to specify a specific version of a program?
  - BP – I’ve thought about this a bit, but nothing implemented yet. For now I’d recommend users handle this with tags.
- PB – Is record fetching still done in batches?
  - BP – It should be… I might need to double check
- PB – Is pandas dataframe compilation done on the server and then sent to client?
  - BP – No, it’s done locally on the client. There’s an internal cache that I may need to look at again, and I’d like to add a feature where we save the cache for later.
- DD – So no more ds.save?
  - BP – Correct
- JW – Is it possible for datasets to exist with some spec/entry permutations with no records?
  - BP – Yes, the records aren’t created until you do ds.submit, and you can have a “sparse” dataset.
- PB – After submission, it checks for existing calcs, and if there’s a match it’s returned? What if they’re flagged invalid?
  - BP – Yes. If they’re invalid you’ll get the invalid record returned. I might add some kwargs to the submit method to force duplicates or something.
- DD – Possible to rename specifications on existing datasets?
  - BP – Yes, but I won’t cover that today
- LW – Is delete access scoped?
  - BP – Not right now, though there is some notion of ownership and limited permissions, and we’re running separate servers for OpenFF, ML, and MolSSI.
- … (discussion about datasets/server management, see around 1:45 in recording)…
- DD – Can we pull qc_vars?
- JW – Is there a way to go from record to entry? It would be hard right because a record may not be associated with a dataset, or it may be associated with multiple/
  - BP – client.query_dataset_records(rec_id) will return a dict that identifies the dataset and the entry id inside the dataset
  - DD – Is there a more direct way you’d recommend we use?
  - BP – This is roughly the most direct way.
- BP – (there’s effectively a way to go from any record to find the parent or dataset it belongs to)
MolSSI QCArchive user group
- user questions / issues / feature request
  - QCSubmit constraints question
- server instance statuses
  - QCArchive Legacy
  - QCArchive OpenFF
    - currently retains everything from Legacy
  - QCArchive ML
  - QCArchive Validation
- compute resources statuses
- call for new users
- trainings
  - upcoming PortalClient trainings
  - upcoming compute manager trainings
- deployed stack versions:
  - QCArchive Legacy
    - 0.15.8.1
  - QCArchive OpenFF
    - 0.50.0b11
  - QCArchive ML
    - 0.50.0b12
  - QCArchive Validation
    - 0.50.0b12
New datasets
- SPICE 2.0
- OpenFF Optimization Diverse Fragments with Iodine (w/ ESPs)
- OpenFF Optimization Hypervalent Sulfurs (w/ ESPs)
- OpenFF DNA
Updates from stakeholders
- OpenFF
- Genentech
- MolSSI
QCFractal development : sprint begins …
- QCFractal v0.50.0 - imminent
- v0.70.0 milestone:
Additional business
- MolSSI QCArchive Working Group start date: 8/29

Discussion topics

Item

Notes

PortalClient Training : slides

Next major version is v0.50
Hardware backing servers is now sufficiently provisioned to handle load from users, compute, storage needs
Software stack changes:
- web api: flask + gunicorn
- auth: password + jwt
- compute manager: just parsl
now features consistent terminology
- database more relational than previous; db-constraint-based consistency
- more done server-side
  - dataset submission, status checking; improves performance
- emphasis on iterative fetching: can iterate over records instead of pulling everything down at once as a big blob
- web API is more accessible; technically doesn’t need QCPortal client
- ownership tracking: records belong to identities and groups; not yet used to enforce permission boundaries
Raw web API
- browser can access records in a far more standard way
All calculations are records
Some records are services
- e.g. TorsionDrive, GridOptimization, etc.
One dataset type per record type; all behave similarly
May bring back idea of collection as some kind of heterogeneous dataset, but not present as a concept right now
entry - typically the molecule input : the rows in a dataset
specification - method/basis/keywords : the columns in a dataset
in a dataset, the combination of an entry (row) and specification (column) refer to a record
primary object you work with is records, don’t directy work with tasks and services (though accessible as properties on records)
more usage of properties in records with automatic fetching (e.g. wavefunctions)
getting and querying
- get_records(ids) returns list of records in same order as ids
- query_records(...) returns an iterator, no guaranteed order of results
submit with add_*
only metadata, specifications and names of entries downloaded at first
- full records only downloaded as requested
BP : have an old instance of QCFractal server running that serves requests from the old FractalClient; have an equivalent new QCFractal server running that serves requests from the new PortalClient

PortalClient Training : interactive

mamba create -n qcportal_tutorial -c qcarchive/label/next qcportal nglview jupyter
the full history of executions for a given record are preserved; can introspect repeat failures
error cycling server-side is configured server-wide; not yet dataset or record-specific
submitting individual calculations (e.g. add_singlepoints) returns InsertMetadata and record ids for the records submitted
- InsertMetadata will tell you if the record existed already or a new record was created
query_* methods return an iterator over records
- will return actual record objects as you iterate through; handle with care
datasets
- ds.specifications gives dict-like access to the dataset column values, which are the specifications themselves
- ds.get_entry can be used to retrieve the entries by name; ds.entry_names will give all entry names
- for getting and iterating over records, the main method to use is ds.iterate_records, which will yield an iterator with optional filtering based on e.g. entry, specification, status, etc.
- there is a convenience method for creating pandas dataframes for properties of interest with ds.compile_values
- datasets do feature an internal cache, reducing the need to request the same entities multiple times
dataset submission
- client.add_dataset method can be used to create a new dataset of any type
- ds.add_entry immediately adds the entry to the server
  - can be batched with ds.add_entries
  - no more ds.save
- ds.add_specification immediately adds the the specifications to the server
- ds.submit is used to actually submit calculations; with no args it will create records for each entry/specification combo in the dataset
  - you can choose which entries/specifications to submit calculations for if you’d prefer to target only specific combinations in a dataset
- can use client.query_records(child_id=<id>) to go up from e.g. an optimization record into its torsiondrive(s)

Participants

Goals

Discussion topics

Action items

Decisions

0 Comments