/
2021-06-03 Dalke/Wagner check in

2021-06-03 Dalke/Wagner check in

Participants

  • @Andrew Dalke (Deactivated)

  • @Jeffrey Wagner

Discussion topics

Item

Notes

Item

Notes

Refine goals for work

  • How to get a new molecule test set

    • JW – SHould be a set of molecule inputs that hit every edge case handled in our code, and rise every possible warnings and error. Also should have 20-3 “totally normal” molecules that are processed successfully.

    • AD – Should it be a minimal set, or a minimal distinct set? If one mol can trigger 4 errors, should I just include that, or 4 mols (one for each error?)

    • JW – I think one for each error, so that we can ensure that each one hits the appropriate code

    •  

  • AD – In reading molecules using RDKit, and then calling from_rdkit, I get an RDKit error (specifically this mol has Germanide, raises valence error)

    • JW – I’d love to be able to say “if the following things are true about an input mol, then it’s appropriate as input for OpenFF” in the docs.

    • AD – It’s hard to catch rdkit errors/warnings, since they’re frequently spruious. Like, if I’m catching errors/warnings to check for real problems, how do we handle all cases (“like title line greater than 80 characters”).

    • AD – In many of these cases, rdkit is giving a warning that it may be mangling a molecule, in situations where OE is loading them just fine.

    •  

  • JW – We could put a lot of trust in the cheminf toolit santiziation/validation, butwe’d still want to guarantee 99%+ identical molecules being loaded from reasonable druglike input.

  • AD – I’d like to be able to present a list of mismatches, but the complexity of how stereo and implicit hydrogens are handled makes it extremely complex.

    • JW – Maybe should we reduce the problem space of this work – Let’s exclude metals for now.

  • AD – I’ve been looking into stereochemistry (issue 146) and am looking for reproducing examples of the difference cases of those.

  •  

  • AD – Working on automating creation of a test set that exercises all code paths in current toolkit, and makes a minimal set of molecules that exercise those paths.

  • AD – Stereo coverage in test set?

    • Long chains of directional double bonds?

      • JW – Yes, but I don’t think this requires a special test, since I trust the cheminf toolkits to handle it well.

    • Should we have separate test cases where we have stereo in rings vs not in rings?

      • JW – Yes. Stereo in rings can be very confusing.

    • Do we want to record undefined stereocenters?

      • JW – Yes, eventually, but I think this might be a big lift, so let’s not make it an initial goal. It’s fine if we include some cases of this in the test set for future use, though.

      • AD – Stereo from 3D vs connection table? You had said that we should use connection table over 3D, but OpenEye uses 3D by default.

      • JW – I may have been wrong about our existing implementation of this. My priority is to ensure that both toolkitwrappers get the same molecule out of the same file, so if a SDF has 3D stereo that contradicts the connection table stereo, we should ensure that the same molecule comes out.

        • AD – Cases like this should cause OE + RDKit to raise a warning. They’ll print warning like “correcting stereo atom X”. My framework is catching these sorts of warnings.

    • Cases where an atom where an atom COULD be chiral but the molecule is actually symmetrical so it's not.

  • AD – Having an “openFF stereo model”?



Overall plan

  • Make a test that that fully exercises current OFF toolkit behavior so that we can detect whether changes are reverse-compatible.

  • Define what the behavior SHOULD be when loading mols from different sources

    • Finding current discrepancies between toolkits

    • Coming up with tricky molecules and feeding them in

    • Defining the desired behavior ahead of time, and comparing that to reality

  • Make “molecule set 2” that exercises the aspirational behaviors that we want

  • Implement “molecule set 2” in our tests, marking many of them as being “the behavior that we have, but not that we want”

  • Use tests to develop the “behavior that we want”

Review decision tree for loading mols

  • 2021-05-31 Dalke/Wagner Check in

  • It would be easiest to just let OE and RDK “make the decisions”, but how can we do that given all of the customization we already implement? Are there principles we can follow to make this happen?

  • Can we document what we expect from Molecule.from_file/from_rdkit/from_openeye? That way users could do all sorts of workflows and know how to prepare mols for their return.

 

  • AD – Where to park the MiniDrugBank → ChEMBL study?

    • JW – Probably best to put it on the MiniDrugBank repo:

    •  

  • AD – What are rules around licensing for tools that we are able to use?

    • JW – Anything that goes into our repo (including test mols) must be compatible with our MIT license

    • AD – CACTVS databse? Wikipedia database?

      • JW – These seem to be OK to use. I’d trust your judgement on this without going fully into the text of the licenses.

Next to-dos

  • AD – I’ll start putting together a maximal-coverage test set using MiniDrugBank

    • JW – I think that’s a good idea for catching edge behavior,s but I suspect that MiniDrugBank has systematic issues that prevent it from having some examples of real chemistry, so we’ll want to supplement this set with examples of “good” inputs.

    • AD – I’ll also plan for this set to have “normal” entries from other databases.

  • AD – Next meeting on Monday

Action items

Decisions