2020-09-03 Wagner Thompson Cheminformatics Meeting notes

Date

Sep 3, 2020

Participants

  • @Jeffrey Wagner

  • @Matt Thompson

Discussion topics

Item

Notes

Item

Notes

Can we do bottom-to-top matching?

  • JW – Two categories – Depends on whether we know how many “slots” final topology will have

    • Yes, for cases with slots that can be determined entirely from Topology

      • bonds,

      • angles,

      • propers,

      • vdW,

      • GBSA (probably)

    • No for

      • impropers (unknown n_slots),

      • maybe library charges (unknown overlaps),

      • chargeincrements (unknown overlaps, one atom can be part of multiple chargeincrmeents),

      • vsites (unknown n_slots, unknown overlaps, double assignments to same slot if different names)

Possible performance gains

  • Currently (very approximate) – 5 AAs + 115 waters = 45 seconds

  • Bottom-to-top matching (2x)

  • Remove redundant/symmetric parameters (2x)

  • Multithreaded SMIRKS matching (10x)

  • Calling OE+RDKit and asking them to make it faster (10x)

  • Topology recognition of polymer subunits (100x) – Lots of effort

  •  

SMIRKS equivalence checking

Example of two equivalent SMIRKS:

<Bond smirks="[H][C@@]([C]=O)([C:1]([H:2])([H])[S])[N][H]" length="1.09 * angstrom" k="680.0 * angstrom**-2 * mole**-1 * kilocalorie" id="A14SB-MainChain_CYX-2C_H1"></Bond> <Bond smirks="[H][C@@]([C]=O)([C:1]([H])([H:2])[S])[N][H]" length="1.09 * angstrom" k="680.0 * angstrom**-2 * mole**-1 * kilocalorie" id="A14SB-MainChain_CYX-2C_H1"></Bond>

Concerns about aromaticity in SMIRKS (more of a vague concern that an easy solution would overlook something important wrt aromaticity)

 

Closest example of a concrete problem with aromaticity in protein SMARTS:

Problem above is that the parameter below didn’t match a structure of ARG in a different resonance form.


<Proper smirks="[H][C@@]([C]=O)([C:1]([H])([H])[C:2]([H])([H])[C:3]([H:4])([H])[N+](=C(N([H])[H])N([H])[H])[H])[N][H]" periodicity1="3" phase1="0.0 * degree" id="A14SB-MainChain-ARG-C8_C8_C8_H1" k1="0.1556 * mole**-1 * kilocalorie" idivf1="1.0"></Proper>

 

I’m not sure whether the problem above is directly related to a problem we’d encounter in SMARTS deduplications – Really it’s a question of what we expect form different representations at different steps in the protein FF porting/parameter application pipeline.

“Guanidinium” – https://en.wikipedia.org/wiki/Guanidine

 

[N+](=C(N([H])[H])N([H])[H])[H])[N][H]"

from openforcefield.topology import Molecule Molecule.from_smiles("[N+](=C(N([H])[H])N([H])[H])[H]") mol = Molecule.from_smiles("[N+]([H])([H])(=C(N([H])[H])N([H])[H])") mol.to_smiles()

Difference between interpreting the above in SMILES vs. tagged SMARTS:

  • If it’s a SMILES, then writing it in “upper case” created a representation that could be read by another tool, and safely interpreted as aromatic (because a SMILES indicates that it’s an entire molecule)

  • If it’s a SMIRKS/tagged SMARTS however, it may represent part of a molecule with additional bonds/atoms that would make it NON-aromatic, so the SMIRKS has a different meaning written with upper- and lower-case letters.

 

How does our current machinery interpret aromaticity in SMIRKS?

  • Unknown. The guanidium above does NOT show up as aromatic

 

Initial implementation

  • Bronze medal: Just solving the symmetric H’s problem is a big help and will reduce FF size by 2x.

  • Gold medal: Successfully deduplicating in the face of aromaticity will help in some remaining edge cases

 

IMPORANTLY – False NEGATIVES are ok – Saying that two SMIRKS aren’t euqivalent when they really Are will just be an inconvenience in our planned workflows. However, False POSITIVES are really bad, since they’ll have deleteing parameters that don’t actually have a replacement/aren’t really redundant with another one.

Action items

Decisions