2020-12-09 Benchmarking for Industry Partners - Development Meeting notes

Date

Dec 9, 2020

Participants

  • @David Dotson

  • @Joshua Horton

  • @David Hahn

  • @Jeffrey Wagner

Goals

  • Updates from project team members

  • Identify and address development issues encountered

  • Identify and address project risks

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Updates

 

  • JH: Tried using CLI; broken at the moment

    • tried to build the conda env, doesn’t work on MacOS

    • Should make an analysis environment that works on MacOS

    • Tests of cli: can directly test CLI components as library functions

      • JH: will put together test functions for cli

  • JW: merged conformer generation on Monday

    • Code we use in toolkit read SDFs doesn’t split on 3d vs. 2d inputs

    • do round-tripping to ensure we catch errors in validation

    • Increases error rate observed to about 5% on test datasets, which would have happened further down the workflow

    • This week: if user provides no geometry, just graph, job is straightforward for conformer generation

      • if they do provide geometry, then we have to do post-trimming, guaranteeing keeping the ones they provided

      • doing a greedy search; generate conformers (up to 10); start with first user conformer, compare to others, drop those within 2A; repeat for other user conformers

      • DD: may want to start with e.g. 30 generated conformers, since the culling may give much fewer than 10 total if only 10 are generated after comparison to user conformers

      • DH: one approach would be to generate conformers with a cutoff criteria of 0 angstroms (to get a guaranteed number of conformers), then filter them as desired after

        • downside, may take longer all told depending on the number of conformers desired

      • JW: if there is an energy-ordering of conformers from e.g. RDKit, then have to be careful we don’t systematically drop certain conformers

      • DH: If it’s not possible to meet the 10 conformer threshold, that’s okay, but nice to hit 10, even if they are similar

        • JW: would a crude minimum be two conformers?

        • If we have a molecule that has no flexibility, should it be removed?

          • because part of our analysis is geometry comparison, we would want to know about cases that give clearly different geometries for QM vs. MM

      • JW: Will pursue an iterative-ratcheting approach for the RMS cutoff filter, starting at 5A

      • What about cases where a user puts in two conformers that are very similar? Should we throw any out?

        • DH: there could be cases where a hydroxy group is deliberately placed in two different orientations, but this would manifest as low-RMSD difference

    • (General) – Should we do any MM minimization of QM geometries? What if conformers “wander off” from QM minimum during MM optimization?

      • DD – Could either use QM minimum only, or record two separate energies for each QM geometry (MM energy at QM minimum geometry, and MM energy after brief minimization)

      • JW – Maybe we could make it constrained MM minimization, so that MM minimization can clean up things like simple bond stretches, and improve the signal/noise ratio for “useful” conformer energetics

      • DD + JH – Could it actually be useful to have some geometries minimize to the same geometry? Or end up with closely-spaced output geometries?

      • (General) – Different strategies here would shift the focus of this benchmarking between “getting energies right” and “getting geometries right”. No study to date does a good job of measuring both, and there are some major inherent differences in workflows for each.

      • Decision – There are a bunch of ways we could go on this, but let’s stick closely to the Lim paper for initial work.

  • DH: playing with the drug-like molecules

    • integrating analysis into CLI

      • now the CLI commands work: openff-benchmark report compare-force-fields

      • QM reference comparison to MM

      • RMSD, torsion fingerprints

      • Plotting also produced; 2d scatter plots, correlation plots,

    • DD: will add units to all quantities exported

    • DH: specify ref_method

    • DD: ref_method can refer to spec keys; I’ll need to add into the SDFs

    • DH: report given as a set of CSVs; these are the inputs for any visualization that follows

  • DD:

    • Trying to test out components and standardize how the CLI looks and how tests are run.

    • Working on ingesting Fox and Swope sets

    • Some issues with modifications to initial steps – Will have working session with JW.

  • DD – Preparing to submit public industry datasets to QCA. Will need CLI components to prepare these molecules.

    • JH – Sounds good

    • DH – Hopefully a dataset coming from Janssen for this soon.

    •  







Action items

Decisions

  • There are a bunch of ways we could go on this, but let’s stick closely to the Lim paper for initial work.