2020-09-29 QCFractal User Group Meeting notes

Date

29 Sep 2020

Participants

Discussion topics

Item

Notes

Updates from MolSSI

Tentative release plan for new QCF version
BP will work with LN to make new QCF release. Looking at doing this Wednesday and Thursday. Production server will be upgraded once release is out.
INFO: Managers will need to be shut down/upgraded during this time.
BP – Will managers end up being out of sync with server?
- DD – I’ll update conda envs. What QCF version number should we be looking for?
  - BP – 0.14
- TG – I have the only running managers, and am happy to shut them down for a few days.
BP – This release should help with a few tasks – Restarting incompletes, compression, upgrade to new version of postgres.
- TG – Will this let us apply the fixes to old tasks?
- BP – Yes
BP – Could DD or TG review PR #620?

David Dotsonwill review QCF PR #620

DD – We will avoid new submissions during the upgrade window.

Ben Pritchard will do an @channel message on QCFractal channel on OpenFF slack when QCF upgrade begins and ends.

Queue/Manager status

DD – Queue is empty, I’m preparing a protein fragment torsiondrive dataset.
DD – I’m also preparing a submission for JM.
DD – TG is also preparing some datasets.
DD – JH, protein fragment dataset had off-by-one index problem, so we’ll want to fix and resubmit.
- TG – Let’s make sure to mark these as defunct
- DD – We will indicate this by incrementing the version number on the dataset.

Joshua Hortonwill update protein optimization dataset and resubmit.

User questions

JH – With off-by-one problem with protein dataset, can we get rid of those datasets?
- BP – In a future release, we’ll add a mechanism to mark these as defunct/invalid. This would make them impossible to pull down, and/or effectively hide them from queries.
- JW – Could change dataset name?
- BP – Not possible yet, but that would be a nice feature.
- TG – In my dataset standards, I’m going to require a changelog in the metadata, which could store important context/information like this.
- (General) – Could either continually increment new dataset numbers, OR could have each dataset be separate, and have one “meta” dataset for each series that points to latest version.
- DD – Both of these seem possible, it’d be nice if QCA became opinionated about which way to do it.
- BP – I’m up to consider having versions of datasets be supported in QCF.
- Decision: Our dataset naming scheme should not assume any changes from QCF
JM – DD, which of my datasets are you preparing for submission?
- DD – MM energies
DD – Based on feedback from JM and Dominic Rufa, we’ve gotten a lot of user feedback that we’ll be incorporating into the infrastructure.
TG – I’d like to have managers report which specific task ID they are pushing back to server on completion/failure.
- BP – I’m making a number of other changes in a current PR. Will see if I can squeeze this in.
DD/JH – From the client, you can delete a collection
- JH – Not sure if this is on the database level, or the client view level?
- BP – It’ll be the client view level.
- JH – Is there an SQL command I can use to delete the dataset?
- BP – I doubt it
- BP – Something to possibly discuss at Friday QC submission meeting – The optimization objects have a way to delete some part of output.
JH – Do we still need to store BOTH WBOs and Meyer indices?
- DD – JM recently asked for both, and other users are likely using it too.
- TG – IIRC, the compute and storage requirements to store both are small
- Decision: We will continue storing both
PB – Do all datasets have WBOs?
- (General) – Everything submitted through QCSubmit does. Older sets (like those used in Parsley/openff-1.0.0 fit) don’t, or are hit-or-miss.
JH – Galileo hosts a QCA instance. How can we record a database when the hosting docker container gets shut down?
- JW – We could either pull the database file out before the container shuts down, or we could save the state of the container. IIRC, we can’t export the running image, but we may have a way to store artifacts.
- DD – Do they have other protocols that require persistence? Or something that dumps data out to Amazon S3?
- JH – If we know where the database file goes, we could specifically export the database.
- BP – There’s a database dump option that will produce a file
- DD – Could make the final step in the workflow that triggers this file output, but you’d need to be able to initialize a new container with that file.
- TG – You could, at the end of the docker run, export the database, store it somewhere, and inject it at the beginning of the subsequent run.
- (General) – We want a way to have a persistent database for testing bespoke workflow/qcsubmit on Galileo
  - Option 1: Could make pre-populated image and host on openff dockerhub, but this will start to take up a lot of space
  - Option 2: Could make Galileo images dump out database file on shutdown (SIGKILL/SIGTERM), and find a way to inject it into images when they start
- JW – Could include database file in folder that gets submitted
- DD – Would need to have conditional to search for a database dump at startup and incorporate it if present.
(General) – Could Galileo host compute for industry benchmarking project?
- DD – Would pharma companies allow in-house molecules to go to external compute provider like Galileo?
- JW – Galileo was also asking about industry security requirements.
- Next Galileo meeting in 9 days (next Thursday)
- DD – I can talk to Roche and Janssen about this, since we’re already in contact.
- (General) – Industry doesn’t know about Galileo yet.

David Dotson will contact David Hahn about whether industry would be willing to consider using Galileo for QCF calculations.
Jeffrey Wagner will warn Galileo that we’re looking at connecting industry partners to their compute infrastructure, and could have a meeting to establish data security requirements.