2021-11-24 Industry benchmarks meeting notes

Participants

@Lorenzo D'Amore
@Jeffrey Wagner
@David Dotson

Goals

Benchmark manuscript
RMSD calculation with RdKit
Feedback from partner retrospective survey

Discussion topics

Item	Presenter	Notes

Item

Presenter

Notes

Benchmark manuscript

Lorenzo

LD: working on manuscript with Gary on Overleaf
- starting with structure from Lim paper, updating with the study content from the benchmark
- started with some descriptive stats of dataset
- realized we need a coverage report for public dataset
  - just want to illustrate how we go from initial number of molecules to the number of molecules that make it all the way through the pipeline
  - coverage report is first (after validation) filter of input molecules
  - will create an SI table that includes how many molecules made it through each optimization that follows
- JW: one representation of this process is one I’ve seen for clinical trials
  - [shows flow diagram giving schematic]
LD: on public dataset, we have a good record of how many molecules we have at each step after QC. But in the public dataset we don’t know how many molecules there were before the QM
- DD: I’ve recorded how many molecules there were in the submission README:
- LD: Perfect. This is just what I needed.
LD – In the Lim paper, they used OPLS3 to get bespoke parameters. But here we used a command line tool. So I’m going to ask DHahn how to reconcile this, though I suspect that the process is identical.
- JW – That’s a good detail to note. But if the different methods (OPLS3 parameters from the maestro GUI vs. the CLI) produce different results, that’s not “our” problem. We just want to make sure that we record what was done for THIS study.
JW – The place where the draft said that we used OEAM1BCCELF10 is incorrect. The MM workers for our dataset didn’t have access to OE, so they would have used single-conformer AM1BCC.
LD – DD, at our next working session, let’s handle the issue where the histograms have the legend items in different orders/with different colors.
LD – Mobley showed us this past week results of OpenFF params analysis that showed some torsions contained wrong multiplicities/periodicities.
- t166 in sage, t157 in parsley
- t74 sage, t69 parsley

RMSD calculation with RdKit

LD – In previous meeting, we noticed a problem where our analysis was coded to compare molecules with the same identifier, but didn’t consider the possibility of having different company codes. DHahn opened openff-benchmark Issue #101 and PR #102 to document and fix this. When I run this locally it works better, but some RDKit comparisons still raise unhandled errors.
* JW – have we found a molecule we can reproduce the RMSD issue with?
LD – yes, we have from the logs, e.g. GNT-00927-00
JW – see nitrogen 20 could be weird, looks a bit like a peptide bond, connected to an aromatic system; could be chemical perception issues there
- also nitrogen 1 which is pyramidal; if it flips RDKit might consider it a different molecule (actually, it’s OpenEye that has issues with pyramidal nitrogens)
- JW – …
(general) The problem with GNT-00927 is that sulfonyl(?) /sulfonamide is represented as S(=O)(=O) in the MM molecule, but [S+2]([O-])([O-]) in the QM-exported molecule. The QCArchive record for this molecule has the [S+2] form in its CMILES, so the problem happened before submission.
Solution:

from rdkit.Chem import MolStandardize
rdmol = MolStandardize.rdMolStandardize.Normalize(rdmol)

LD – This is interesting – The comparison failed because the OPLS results were normalizing the form of the sulfonyls (so the S+2 and O-s always converted to neutral) while the other pathways strictly preserved the input form

Survey feedback

(DD shared results link, will not post in meeting notes since they’re public)

Meetings

2021-11-24 Industry benchmarks meeting notes

Participants

Goals

Discussion topics

Action items

Decisions