2023-07-25 OpenFF QCA Working Group Meeting Notes

Participants

  • @David Dotson

  • @Alexandra McIsaac

  • Ben Pritchard

  • @Brent Westbrook (Unlicensed)

  • @Lily Wang

  • @Pavan Behara

  • Peter Eastman

  • @Jeffrey Wagner

 

Meeting recording: https://us02web.zoom.us/rec/share/Nhv6CRrhAW1fsYzScco7jSn-0auO6EhyqDGe_uTjR1csbFBvZMmg4dh7Tuoamprz.eYQBjxBg8oAq1w7U?startTime=1690311876000

Goals

  • PortalClient Training

    • BP will link slides here

    • PB – Are we still splitting dispersion calculation from functional?

      • BP – Good question. It’s all together in the new version. So if you’re running a lot of different functionals with dispersion correction, it may make sense to add a new specification for the dispersion correction.

    • PB – Can we access the new datasets in the new server?

      • BP – Yes, the new hardware is running a copy of the old server and the new server, and you can use either the old or new client to access it. We also have an ML version - Splitting out the old OpenFF instance, now have an OpenFF instance, a MolSSI instance, and an ML instance.

    • DD – And to emphasize - We should only be submitting using the NEW client to the NEW server, right?

      • BP – yes

  • Tutorial

    • For copy and paste: mamba create -n qcportal-tutorial -c qcarchive/label/next -c conda-forge qcportal nglview jupyter

    •  

    •  

    • DD – Are records always integers?

      • BP – Yes, now they’re integers. Previously they were strings that had to contain integers.

    • BP – Oh, I shoulodn’t mentioned that statuses are improved - There’s completed, running, error, waiting. There are a small number of others but I won’t mention those today.

    •  

    • DD - If we have to resubmit the same calc because of dependency/infra issues?

      • BP – I hope to eventually implement support for duplicate submissions, but it’s ont there yet. Could delete the original entry or resubmit with slightly different input.

    • PB - regarding error messages, we don't have to do client.query_kvstore() anymore, right?

    • PE – Can we further elaborate on program in specification, to specify a specific version of a program?

      • BP – I’ve thought about this a bit, but nothing implemented yet. For now I’d recommend users handle this with tags.

    • PB – Is record fetching still done in batches?

      • BP – It should be… I might need to double check

    • PB – Is pandas dataframe compilation done on the server and then sent to client?

      • BP – No, it’s done locally on the client. There’s an internal cache that I may need to look at again, and I’d like to add a feature where we save the cache for later.

      •  

    •  

    • DD – So no more ds.save?

      • BP – Correct

    • JW – Is it possible for datasets to exist with some spec/entry permutations with no records?

      • BP – Yes, the records aren’t created until you do ds.submit, and you can have a “sparse” dataset.

    • PB – After submission, it checks for existing calcs, and if there’s a match it’s returned? What if they’re flagged invalid?

      • BP – Yes. If they’re invalid you’ll get the invalid record returned. I might add some kwargs to the submit method to force duplicates or something.

    • DD – Possible to rename specifications on existing datasets?

      • BP – Yes, but I won’t cover that today

    • LW – Is delete access scoped?

      • BP – Not right now, though there is some notion of ownership and limited permissions, and we’re running separate servers for OpenFF, ML, and MolSSI.

    •  

    • … (discussion about datasets/server management, see around 1:45 in recording)…

    • DD – Can we pull qc_vars?

    • JW – Is there a way to go from record to entry? It would be hard right because a record may not be associated with a dataset, or it may be associated with multiple/

      • BP – client.query_dataset_records(rec_id) will return a dict that identifies the dataset and the entry id inside the dataset

      • DD – Is there a more direct way you’d recommend we use?

      • BP – This is roughly the most direct way.

    • BP – (there’s effectively a way to go from any record to find the parent or dataset it belongs to)

  • MolSSI QCArchive user group

    • user questions / issues / feature request

    • server instance statuses

      • QCArchive Legacy

      • QCArchive OpenFF

        • currently retains everything from Legacy

      • QCArchive ML

      • QCArchive Validation

    • compute resources statuses

    • call for new users

    • trainings

      • upcoming PortalClient trainings

      • upcoming compute manager trainings

    • deployed stack versions:

      • QCArchive Legacy

        • 0.15.8.1

      • QCArchive OpenFF

        • 0.50.0b11

      • QCArchive ML

        • 0.50.0b12

      • QCArchive Validation

        • 0.50.0b12

  • New datasets

    • SPICE 2.0

    • OpenFF Optimization Diverse Fragments with Iodine (w/ ESPs)

    • OpenFF Optimization Hypervalent Sulfurs (w/ ESPs)

    • OpenFF DNA

  • Updates from stakeholders

    • OpenFF

    • Genentech

    • MolSSI

  • QCFractal development : sprint begins …

    • QCFractal v0.50.0 - imminent

    • v0.70.0 milestone:

  • Additional business

    • MolSSI QCArchive Working Group start date: 8/29

Discussion topics

Item

Notes

Item

Notes

PortalClient Training : slides

  • Next major version is v0.50

  • Hardware backing servers is now sufficiently provisioned to handle load from users, compute, storage needs

  • Software stack changes:

    • web api: flask + gunicorn

    • auth: password + jwt

    • compute manager: just parsl

  • now features consistent terminology

    • database more relational than previous; db-constraint-based consistency

    • more done server-side

      • dataset submission, status checking; improves performance

    • emphasis on iterative fetching: can iterate over records instead of pulling everything down at once as a big blob

    • web API is more accessible; technically doesn’t need QCPortal client

    • ownership tracking: records belong to identities and groups; not yet used to enforce permission boundaries

  • Raw web API

    • browser can access records in a far more standard way

  • All calculations are records

  • Some records are services

    • e.g. TorsionDrive, GridOptimization, etc.

  • One dataset type per record type; all behave similarly

  • May bring back idea of collection as some kind of heterogeneous dataset, but not present as a concept right now

  • entry - typically the molecule input : the rows in a dataset

  • specification - method/basis/keywords : the columns in a dataset

  • in a dataset, the combination of an entry (row) and specification (column) refer to a record

  • primary object you work with is records, don’t directy work with tasks and services (though accessible as properties on records)

  • more usage of properties in records with automatic fetching (e.g. wavefunctions)

  • getting and querying

    • get_records(ids) returns list of records in same order as ids

    • query_records(...) returns an iterator, no guaranteed order of results

  • submit with add_*

  • only metadata, specifications and names of entries downloaded at first

    • full records only downloaded as requested

  • BP : have an old instance of QCFractal server running that serves requests from the old FractalClient; have an equivalent new QCFractal server running that serves requests from the new PortalClient

PortalClient Training : interactive

  • mamba create -n qcportal_tutorial -c qcarchive/label/next qcportal nglview jupyter

  • the full history of executions for a given record are preserved; can introspect repeat failures

  • error cycling server-side is configured server-wide; not yet dataset or record-specific

  • submitting individual calculations (e.g. add_singlepoints) returns InsertMetadata and record ids for the records submitted

    • InsertMetadata will tell you if the record existed already or a new record was created

  • query_* methods return an iterator over records

    • will return actual record objects as you iterate through; handle with care

  • datasets

    • ds.specifications gives dict-like access to the dataset column values, which are the specifications themselves

    • ds.get_entry can be used to retrieve the entries by name; ds.entry_names will give all entry names

    • for getting and iterating over records, the main method to use is ds.iterate_records, which will yield an iterator with optional filtering based on e.g. entry, specification, status, etc.

    • there is a convenience method for creating pandas dataframes for properties of interest with ds.compile_values

    • datasets do feature an internal cache, reducing the need to request the same entities multiple times

  • dataset submission

    • client.add_dataset method can be used to create a new dataset of any type

    • ds.add_entry immediately adds the entry to the server

      • can be batched with ds.add_entries

      • no more ds.save

    • ds.add_specification immediately adds the the specifications to the server

    • ds.submit is used to actually submit calculations; with no args it will create records for each entry/specification combo in the dataset

      • you can choose which entries/specifications to submit calculations for if you’d prefer to target only specific combinations in a dataset

    • can use client.query_records(child_id=<id>) to go up from e.g. an optimization record into its torsiondrive(s)

    •  

Action items

Decisions