2022-05-26 Davel/Wagner Meeting notes

 Date

May 26, 2022

 Participants

  • @Jeffrey Wagner

  • @Connor Davel

 Discussion topics

Item

Notes

Item

Notes

General updates

  • JW – Annual meeting in almost exactly a month (June 28), will be slightly less available until that’s done with

  • JW – New science lead coming on board next week (Lily Wang)

Connor progress

  • Jupyter widget

    • CD – Can now select monomers and assign chemical information, save them with residue name,

    • CD – No longer communicates through HTML, but straight through javascript. I tried the file storage route but the browser kept treating it as a file download and bringing up a prompt. I could also use “session storage” in the browser, but then I’d need python to talk to the browser.

    • To dos from last time:

      • Get the notebook able to work on a polymer - Should be able to go from loading PDB → OpenMM system in 10 minutes.

        • Implement formal charge and single/triple bond assigner

          • Done

        • It’s OK if we need to copy and paste SMARTS after clicking on the structure

          • CD figured out how to get javascript to talk directly to the jupyter kernel

        • Try having components talk using file system instead of page html

          • N/A - CD found a better solution (see above)

        • Implement calculate_charges_for_substructures(substructure_list) (could use AM1BCC or gasteiger, we just need to have something)

          • Not yet started

        • Make (at least) two classes, and have all the javascript be in one, and all the rdkit be in another.

  • Non-capturing atoms in SMARTS? Monomer information API/storage

    • JW – We’ve committed to NOT having a public API for monomer information format in the 0.11.0 release, but we may add it for a later release. So this file WILL exist, but we WON’T guarantee a stable format. So we’re free to experiment for the coming months and we won’t be burdened by API/format stability restrictions

    • Architectural constraints

      • FORMAT needs to be ultimately processable by RDKit/OpenEye

      • Information content MAY need to be understandable in networkx/other graph libraries (if RDKit/OE are too slow for realistic use cases)

      • In the monomer library, there can NOT be wildcards/ANDs/ORs.

        • (just for

        • CD – Any possible need for ANDs

          • example: [C;H2;D3]

          • [C:1](-[H:2])(-[H:3])-[C]-[CH3]-[N+1]

          • example: [C;H2D3 (OR) H1D4]

          • ^^could this be done using a node match function?

      • Any AND or OR that would go in a noncapturing atom could be

      • Information content:

        • Capturing atoms

          • MUST have unconditional formal charge value and element

          • MAY have conditional H, D, X, symbols

          • MAY define local stereo

          • MUST NOT define aromaticity or global stereo

        • Capturing bonds

          • MUST have unconditional bond order

          • MAY define local stereo

          • MUST NOT define aromaticity or global stereo

        • Noncapturing/context atoms

          • MAY have conditional everything

          • MUST NOT define aromaticity

        • Noncapturing/context bonds

          • MAY have conditional everything

          • MUST NOT define aromaticity

          • If we do capturing/noncapturing atoms, how do we know whether a bond is capturing/noncapturing?

        • Inter-monomer bonds

          • [S:-1]-[S:-1]-[C]

      • Monomers can’t have aromaticity info

    • User stories

      • Avoid race conditions/order in JSON should NOT matter

        • “We can’t have any graphs that are a subset of other graphs”

        • If we allow noncapturing atoms/bonds, do we EVER allow overlaps?

      • Inter-monomer bonds

        • Can either

          • allow capturing bonds that aren’t between two capturing atoms

            • Fails on CYX? No way to distinguish which one owns the S-S bond

          • Can allow two-pass monomer info assignment (first the monomers, then things that can overwrite monomers)

          • Can allow a special kind of substructures that overlaps with already-assigned substructures

      • User wants to load a natural rubber molecule that includes some polyethylene motifs. The polyethylene substructures erroneously match some non-polyethylene substructures in rubber and should be overwritten

        • This violates the “monomers can never be subsets of other monomers” statement above. So either we need to walk that back or redefine “monomer” so this can’t happen

    • Fundamental questions

      • Is there anything that could be represented as SMARTS that couldn’t be represented as networkx? Or vice versa?

      • For every molecule, is there always a way to break it into more than one monomer that can be used to assign parameters? Are there rules for what we refuse to handle?

      • Is there a strict tradeoff between “no order dependence”, “no substructure overlap”, “capturing bonds”, and “noncapturing atoms/bonds”?

        • Maybe “noncapturing atoms/bonds” is independent?

        • To handle inter-monomer bonds, we NEED either “capturing bonds” or “substructure overlap”

        • If we can make a nonoverlapping substructure library to cover a use case, then order dependence is a non-issue since there will never be conflicts. But if we DO have substructure overlap, then we MAY need order dependence.

      • If we allow “size-order dependence” of matches, are there real cases where two equally-sized substructures could try to assign conflicting info? Would “strict order dependence” resolve these cases without introducing other pitfalls?

      •  

 

  • To dos

    • Immediate

      • Try to make all substructures have capturing atoms ONLY described by element and formal charge, and capturing bonds ONLY described by bond order, with no ANDs or ORs. These substructures may now include noncapturing atoms and bonds, with the [C:-1] style system for noncapturing atoms and bonds. You should modify your local from_pdb to recognize negative map indices.

      • Graph theory question: Is there always a way to find two or more nonoverlapping substructures to tile over a molecule? What’s the maximum number of nonoverlapping substructures for a given molecule? What methods are available to determine that?

      • Graph theory question (harder?): Is it possible to avoid order dependence?

      • Get the notebook able to work on a polymer - Should be able to go from loading PDB → OpenMM system in 10 minutes.

        • Implement calculate_charges_for_substructures(substructure_list) (could use AM1BCC or gasteiger, we just need to have something)

        • Make (at least) two classes, and have all the javascript be in one, and all the rdkit be in another.

    • More distant

      • Same workflow but starting from SDF

      • Defining modular components and their interfaces/APIs

      • Trying to have a program identify monomers

      • Better charge assignment

      • 3D visualization that shows which parts of the substructure that are covered by the already-defined substructures

 Action items

 Decisions