2020-12-02 Benchmarking for Industry Partners - Development Meeting notes

Date

Dec 2, 2020

Participants

@David Dotson
@David Hahn
@Jeffrey Wagner
@Joshua Horton

Goals

Updates from project team members
Identify and address development issues encountered
Aim for merge of existing PRs; demonstrate and coordinate connective tissue between workflow components
Identify and address project risks

Discussion topics

Item	Notes

Item	Notes
Updates from team	JW: Conformers prototyped; can give demo working with dataset from Thomas Fox as a test case; does not go in smoothly of 600 molecules, 8 stereochemistry issues; RDKit is a bit sloppy with stereochemistry issues DD: Gave update on one-shot optimization usage for some, perhaps many, partners JW: should state the units in the SDF file for energies JH: deployment procedure document; gathering information from PRs as they develop trying to use each component to give feedback DH: Pushed analysis components, would like consolidation on read/write functions, seasons For each method, separate SDF, with exactly same molecule; have to be able to relate molecules of the same ID across results Use `pandas` DataFrames primarily; For CLI, separate analysis from plots/report generation analysis artifact produced must be shareable Could still output SDFs with relative metrics to QM included? JW: may not want to do this, since it depends on another SDF somewhere in the results DD: let’s give it a play and see if there is a way to make this less confusing in the output Still need unit conversion; dependent on units being present in input SDFs to this component DD: I’ll review today, push to merge
Dataset object?	Do we want to consolidate on a Dataset object (bundle of openFF mols pulled from SDFs, exports SDFs) could do slicing based on ID components, would make analysis easier at the end; may add value at other components For now, we’ll proceed with merging each PR, then search for places we can consolidate read/write, handling of OFF mols
How to handle errored cases?	JW: 2 kinds: partway through, can throw a message failure with no output Could add to e.g. the validate command: `@click.option('--error-directory', default='1-errors') @click.option('--error-out', default='1-errors.out')` DH: what do we do with “undefined stereochemistry?” JW: these are opinionated parts of the toolkit, which may change/improve, but those improvements tied to release cycle of toolkit, so a bit slower; will just have to use this experience to spin out issues for improvement [decision] since warnings are loud, we’ll squelch warnings; errors still get raised can make clear that in the validation step, some percentage are expected to be excluded (>1%) the minimization step may also have some percentage of failures (>5%) could also say that overall, up to 10% of your dataset may not make it end-to-end [decision]: we’ll make a slack channel for support; allows us to operate with low-friction, loop in folks as needed for help understanding weird cases need to make clear that the channel is public, and that error messages should be posted with care
Structures from the PDB	Public submission of 6000 conformers DH: will share on GDrive as a tarball for consumption by Jeff, DD
Basis set choice	DD: what are the goalposts for choice for basis? JH: fast and accurate DH: Lim paper uses the `default` (DZVP) basis [decision] DZVP will be our basis for this season JH: Using DZVP will produce good results for OpenFF since it’s fit to this; GAFF wasn’t, so technically not as “fair”; we do want to start from this point for evaluating OpenFF though, since that’s the goal DD: could make it fairly easy to inject compute specs for the curious

Action items

@David Dotson will add units to energies in SDF files exported from compute

@David Hahn will experiment with minimally-ambiguous ways to output SDFs with metrics relative to QM results

@Jeffrey Wagner will error handling destination options to the validate command, subsequent workflow commands pre-compute

@Jeffrey Wagner will squelch loud warnings from toolkit, RDKit, still ensure errors raised and reported in error log

@David Dotson will create a slack channel for support, e.g. benchmarks-support, with clear indication that all information shared is public and error messages should be posted with care

@David Hahn will share PDB submission of 6000 conformers via GDrive as tarball for consumption by Jeff, David, Josh

@David Dotson will set dzvp as QM basis for Season 1

@David Dotson will ensure it is possible to compute other QM specs relatively easily from CLI

@David Dotson will review David Hahn’s analysis PR; clear for merge at David H’s discretion

Meetings