Benchmark manuscript | Lorenzo | LD: working on manuscript with Gary on Overleaf starting with structure from Lim paper, updating with the study content from the benchmark started with some descriptive stats of dataset realized we need a coverage report for public dataset just want to illustrate how we go from initial number of molecules to the number of molecules that make it all the way through the pipeline coverage report is first (after validation) filter of input molecules will create an SI table that includes how many molecules made it through each optimization that follows
JW: one representation of this process is one I’ve seen for clinical trials
LD: on public dataset, we have a good record of how many molecules we have at each step after QC. But in the public dataset we don’t know how many molecules there were before the QM LD – In the Lim paper, they used OPLS3 to get bespoke parameters. But here we used a command line tool. So I’m going to ask DHahn how to reconcile this, though I suspect that the process is identical. JW – The place where the draft said that we used OEAM1BCCELF10 is incorrect. The MM workers for our dataset didn’t have access to OE, so they would have used single-conformer AM1BCC. LD – DD, at our next working session, let’s handle the issue where the histograms have the legend items in different orders/with different colors. LD – Mobley showed us this past week results of OpenFF params analysis that showed some torsions contained wrong multiplicities/periodicities.
|
RMSD calculation with RdKit |
| LD – In previous meeting, we noticed a problem where our analysis was coded to compare molecules with the same identifier, but didn’t consider the possibility of having different company codes. DHahn opened openff-benchmark Issue #101 and PR #102 to document and fix this. When I run this locally it works better, but some RDKit comparisons still raise unhandled errors.
* JW – have we found a molecule we can reproduce the RMSD issue with? LD – yes, we have from the logs, e.g. GNT-00927-00 JW – see nitrogen 20 could be weird, looks a bit like a peptide bond, connected to an aromatic system; could be chemical perception issues there (general) The problem with GNT-00927 is that sulfonyl(?) /sulfonamide is represented as S(=O)(=O) in the MM molecule, but [S+2]([O-])([O-]) in the QM-exported molecule. The QCArchive record for this molecule has the [S+2] form in its CMILES, so the problem happened before submission. Solution:
from rdkit.Chem import MolStandardize
rdmol = MolStandardize.rdMolStandardize.Normalize(rdmol) |