2022-06-09 Substructure loading Meeting notes

 Date

Jun 9, 2022

 Participants

  • @Jeffrey Wagner

  • @Michael Shirts

  • @Connor Davel

  • @Owen Madin

 Discussion topics

Item

Notes

Item

Notes

Big picture planning

  • MS – Big picture is loading and parameterizing substructures. So “doing an equivalent thing to proteins, but more automated”.

    • CD – This is regarding adding chemical info, not charges

    • MS – Right, graph charges will handle charges, and a short-term kludge may exist in the meantime but is out of scope.

    • CD – Do we want to make a workflow where we help people make substructures, or just assume that we have the substructures and help people use them?

    • MS – Two parts - One is to assume the substructures exist and help people to use them. The other is to develop techniques to make substructures.

    • CD – I have a relatively fast way of writing them - The jupyter notebook I have is a graphical way to do this manually. So the automation of this process is what’s next.

    • MS – Big picture is to mix with biopolymers, like handling modified AAs. Still putting charge stuff for later.

    • CD – That’s a good plan. The big question now is how to determine how to do things “correctly”. How do we teach the user what they need to do?

    • MS – It's sufficient to have well-documented failures.

    • CD – I think we can do that if we remove order dependence.

    • MS – JW, does this agree with your plans? Will the existing tech prevent this?

      • JW – Agree that charges derivation is out of scope

      • JW – Substructure stuff is good …

      • MS – Goal is to use this summer to 1) have a good experience for CD and 2) have a foundation for a larger proposal that we could get funded for further work in this area.

      • JW – Good idea. Do we have some details about this possible proposal?

      • MS – Karai Kolina is interested to learn what OFF can do with conjugated polymers, is looking for examples of what we can do. Sent some PDBs that we can test against to see if we can handle loading. So if we can handle that using this project that would be a great proof-of-concept and the basis for a pitch.

        • MS – PDBs are similar homopolymers, except one that’s a heteropolymer that has 3 connection points instead of 2.

      • JW –

        • In a release after 0.11.0, the substructure dictionary for laoding PDBs will be exposed to the public API, meaning that there will be a public way to update the substructures that can be handled during PDB loading.

        • First-class support will always be prioritized for substructures in components.cif/aa_variants.cif (chemical components dictionary)

        • Tagged SMARTS isn’t guaranteed to be the format moving forward, but something with a similar information content will continue to be used.

  • End goals (in order of priority):

    • The fundamental goal of this project is to convert “monomer information” into “substructure information”, at least in a way that makes it easy for user input to guide the process, but ideally in a way where some polymers can be handled manually. Loading homopolymer PDBs + substructure mol2s from collaborators (ideally automatedly, but in under 10 minutes with manual intervention would be good). This should output an OFFMol or SDF of the loaded PDB, with all bond orders+formal charges defined.

      • “homopolymer PDB” = elements + connectivity of a lot of atoms

      • “monomer information” = (elements + formal charges + bond with known orders. Structure DOES NOT contain info identifying which atoms are in caps/neighbors)

        • “Monomer information type 1”: Structure contains caps or atoms in neighboring monomers. Structure is a chemically valid monomer. For example, (a capped monomer with no information about which atoms are capping)

        • “Monomer information type 2”: Structure has dangling bonds or is clearly not a chemically valid molecule

      • “substructure information” = For example, a (capped monomer containing information about which atoms are capping) or (a monomer with dangling bonds)

      • Initially a GUI will be used, but a stretch goal would be for automation can succeed in the process by itself.

      • If some PDBs are impossible to convert into substructures, it would be good to say why + how to recognize those

      • As a stretch goal, handle more than 2 connection points

      • As a stretch goal, the substructure mol2s won’t need to know which atoms are “excluded”

    • Same as above but with heteropolymers

    • (Very challenging objective) - Load capped versions of all 26 amino acids as 26 SDF files, alongside any canonical protein as PDB, and end up with the current Toolkit substructure dictionary. (even more challenging, load a linear sequence of all 26 amino acids as SDF and do the same, maybe 27/nonlinear for CYX)

  • CD – What is concrete about this project?

    • MS+JW – The inputs (PDB and template structures) and the outputs (OFFMols/SDFs). The only thing that the graph matcher should know about each atom during matching is (element, connectivity, map_index)

    • CD – Most of the work in template → PDB chemical info assignment.

      •  

Followups from previous to-dos

  • (CD presentation)

    • Upload slides here

    • OM – Can you have two identical substurctures that only differ by capturing atoms?

      • CD – Good question. It may be indepdent of capturing atoms.

      • CD – I’ll say that “an entire

      • JW – We should define a specification for what the graph matching is “allowed to know” - Like, we should only tell our graph matcher about element, connectivity, and map indices (and maybe a few other things, can discuss later)

      • (General) – We’ll keep discussing storage formats and their information content.

      •  

    • MS –



To dos

  • MS will send workflow from collaborator group that takes a mol2 and capping atom indices as inputs, and produces a polymer.

  • CD will draft a project page with goals in order and specified milestones

  • CD will experiment with an automated process that handles “monomer information type 1”. Ultimate method signature will be make_substructures([monomer_info_sources], [pdbs_to_load])--> substructure_information. The output format isn’t super well specified, but should have equivalent information content to CD’s current substructure dict format (with noncapturing atoms allowed)

  •  

 Action items

 Decisions