2020-09-03 Wagner Thompson Cheminformatics Meeting notes

Date

Sep 3, 2020

Participants

@Jeffrey Wagner
@Matt Thompson

Discussion topics

Item	Notes

Item	Notes
Can we do bottom-to-top matching?	JW – Two categories – Depends on whether we know how many “slots” final topology will have Yes, for cases with slots that can be determined entirely from Topology bonds, angles, propers, vdW, GBSA (probably) No for impropers (unknown n_slots), maybe library charges (unknown overlaps), chargeincrements (unknown overlaps, one atom can be part of multiple chargeincrmeents), vsites (unknown n_slots, unknown overlaps, double assignments to same slot if different names)
Possible performance gains	Currently (very approximate) – 5 AAs + 115 waters = 45 seconds Bottom-to-top matching (2x) Remove redundant/symmetric parameters (2x) Multithreaded SMIRKS matching (10x) Calling OE+RDKit and asking them to make it faster (10x) Topology recognition of polymer subunits (100x) – Lots of effort
SMIRKS equivalence checking	Example of two equivalent SMIRKS: `<Bond smirks="[H][C@@]([C]=O)([C:1]([H:2])([H])[S])[N][H]" length="1.09 * angstrom" k="680.0 * angstrom*-2 mole*-1 kilocalorie" id="A14SB-MainChain_CYX-2C_H1"></Bond> <Bond smirks="[H][C@@]([C]=O)([C:1]([H])([H:2])[S])[N][H]" length="1.09 * angstrom" k="680.0 * angstrom*-2 mole*-1 kilocalorie" id="A14SB-MainChain_CYX-2C_H1"></Bond>` Concerns about aromaticity in SMIRKS (more of a vague concern that an easy solution would overlook something important wrt aromaticity) Closest example of a concrete problem with aromaticity in protein SMARTS: Problem above is that the parameter below didn’t match a structure of ARG in a different resonance form. <Proper smirks="[H][C@@]([C]=O)([C:1]([H])([H])[C:2]([H])([H])[C:3]([H:4])([H])[N+](=C(N([H])[H])N([H])[H])[H])[N][H]" periodicity1="3" phase1="0.0 * degree" id="A14SB-MainChain-ARG-C8_C8_C8_H1" k1="0.1556 * mole*-1 kilocalorie" idivf1="1.0"></Proper> I’m not sure whether the problem above is directly related to a problem we’d encounter in SMARTS deduplications – Really it’s a question of what we expect form different representations at different steps in the protein FF porting/parameter application pipeline. “Guanidinium” – https://en.wikipedia.org/wiki/Guanidine [N+](=C(N([H])[H])N([H])[H])[H])[N][H]" `from openforcefield.topology import Molecule Molecule.from_smiles("[N+](=C(N([H])[H])N([H])[H])[H]") mol = Molecule.from_smiles("[N+]([H])([H])(=C(N([H])[H])N([H])[H])") mol.to_smiles()` Difference between interpreting the above in SMILES vs. tagged SMARTS: If it’s a SMILES, then writing it in “upper case” created a representation that could be read by another tool, and safely interpreted as aromatic (because a SMILES indicates that it’s an entire molecule) If it’s a SMIRKS/tagged SMARTS however, it may represent part of a molecule with additional bonds/atoms that would make it NON-aromatic, so the SMIRKS has a different meaning written with upper- and lower-case letters. How does our current machinery interpret aromaticity in SMIRKS? Unknown. The guanidium above does NOT show up as aromatic
Initial implementation	Bronze medal: Just solving the symmetric H’s problem is a big help and will reduce FF size by 2x. Gold medal: Successfully deduplicating in the face of aromaticity will help in some remaining edge cases IMPORANTLY – False NEGATIVES are ok – Saying that two SMIRKS aren’t euqivalent when they really Are will just be an inconvenience in our planned workflows. However, False POSITIVES are really bad, since they’ll have deleteing parameters that don’t actually have a replacement/aren’t really redundant with another one.

Meetings

2020-09-03 Wagner Thompson Cheminformatics Meeting notes

Date

Participants

Discussion topics

Action items

Decisions