2021-06-03 Dalke/Wagner check in

Participants

  • @Andrew Dalke (Deactivated)

  • @Jeffrey Wagner

Discussion topics

Item

Notes

Item

Notes

Refine goals for work

  • How to get a new molecule test set

    • JW – SHould be a set of molecule inputs that hit every edge case handled in our code, and rise every possible warnings and error. Also should have 20-3 “totally normal” molecules that are processed successfully.

    • AD – Should it be a minimal set, or a minimal distinct set? If one mol can trigger 4 errors, should I just include that, or 4 mols (one for each error?)

    • JW – I think one for each error, so that we can ensure that each one hits the appropriate code

    •  

  • AD – In reading molecules using RDKit, and then calling from_rdkit, I get an RDKit error (specifically this mol has Germanide, raises valence error)

    • JW – I’d love to be able to say “if the following things are true about an input mol, then it’s appropriate as input for OpenFF” in the docs.

    • AD – It’s hard to catch rdkit errors/warnings, since they’re frequently spruious. Like, if I’m catching errors/warnings to check for real problems, how do we handle all cases (“like title line greater than 80 characters”).

    • AD – In many of these cases, rdkit is giving a warning that it may be mangling a molecule, in situations where OE is loading them just fine.

    •  

  • JW – We could put a lot of trust in the cheminf toolit santiziation/validation, butwe’d still want to guarantee 99%+ identical molecules being loaded from reasonable druglike input.

  • AD – I’d like to be able to present a list of mismatches, but the complexity of how stereo and implicit hydrogens are handled makes it extremely complex.

    • JW – Maybe should we reduce the problem space of this work – Let’s exclude metals for now.

  • AD – I’ve been looking into stereochemistry (issue 146) and am looking for reproducing examples of the difference cases of those.

  •  

  • AD – Working on automating creation of a test set that exercises all code paths in current toolkit, and makes a minimal set of molecules that exercise those paths.

  • AD – Stereo coverage in test set?

    • Long chains of directional double bonds?

      • JW – Yes, but I don’t think this requires a special test, since I trust the cheminf toolkits to handle it well.

    • Should we have separate test cases where we have stereo in rings vs not in rings?

      • JW – Yes. Stereo in rings can be very confusing.

    • Do we want to record undefined stereocenters?

      • JW – Yes, eventually, but I think this might be a big lift, so let’s not make it an initial goal. It’s fine if we include some cases of this in the test set for future use, though.

      • AD – Stereo from 3D vs connection table? You had said that we should use connection table over 3D, but OpenEye uses 3D by default.

      • JW – I may have been wrong about our existing implementation of this. My priority is to ensure that both toolkitwrappers get the same molecule out of the same file, so if a SDF has 3D stereo that contradicts the connection table stereo, we should ensure that the same molecule comes out.

        • AD – Cases like this should cause OE + RDKit to raise a warning. They’ll print warning like “correcting stereo atom X”. My framework is catching these sorts of warnings.

    • Cases where an atom where an atom COULD be chiral but the molecule is actually symmetrical so it's not.

  • AD – Having an “openFF stereo model”?



Overall plan

  • Make a test that that fully exercises current OFF toolkit behavior so that we can detect whether changes are reverse-compatible.

  • Define what the behavior SHOULD be when loading mols from different sources

    • Finding current discrepancies between toolkits

    • Coming up with tricky molecules and feeding them in

    • Defining the desired behavior ahead of time, and comparing that to reality

  • Make “molecule set 2” that exercises the aspirational behaviors that we want

  • Implement “molecule set 2” in our tests, marking many of them as being “the behavior that we have, but not that we want”

  • Use tests to develop the “behavior that we want”

Review decision tree for loading mols

 

  • AD – Where to park the MiniDrugBank → ChEMBL study?

    • JW – Probably best to put it on the MiniDrugBank repo:

    •  

  • AD – What are rules around licensing for tools that we are able to use?

    • JW – Anything that goes into our repo (including test mols) must be compatible with our MIT license

    • AD – CACTVS databse? Wikipedia database?

      • JW – These seem to be OK to use. I’d trust your judgement on this without going fully into the text of the licenses.

Next to-dos

  • AD – I’ll start putting together a maximal-coverage test set using MiniDrugBank

    • JW – I think that’s a good idea for catching edge behavior,s but I suspect that MiniDrugBank has systematic issues that prevent it from having some examples of real chemistry, so we’ll want to supplement this set with examples of “good” inputs.

    • AD – I’ll also plan for this set to have “normal” entries from other databases.

  • AD – Next meeting on Monday

Action items

Decisions