Fun with SMARTS / SMIRKS grammar
In the course of programmatically modifying type definitions, I had initially been using very little insight into the structure of the SMARTS / SMIRKS language, and mostly just gluing textual representations of primitives together: .
Recently I’ve started eyeballing a formal grammar for the SMIRKS subset I find most relevant for distinguishing chemical environments. It currently lives here: .
I have yet to handle recursive smarts, although I know how to proceed there.
start : environment
// atomic primitives
atomic_primitive : wildcard
| atomic_number
| aromatic | aliphatic
| valence
| connectivity
| degree
| h_count
| ring_size
| ring_membership
| formal_charge
| aromatic
| aliphatic
wildcard : "*"
atomic_number : "#" NUMBER
degree : "D" NUMBER
valence : "v" NUMBER
connectivity : total_connectivity | ring_connectivity
total_connectivity : "X" NUMBER
ring_connectivity : "x" NUMBER
h_count : total_h_count | implicit_h_count
total_h_count : "H" NUMBER
implicit_h_count : "h" NUMBER
ring_size : "r" NUMBER*
ring_membership : "R" NUMBER*
formal_charge : "+" NUMBER | "-" NUMBER
aromatic : "a"
aliphatic : "A"
// logic
not : "!"
connective : and | or
and : "&" | ";"
or : ","
// composition
atomic_environment : atomic_primitive
| atomic_environment connective atomic_environment
| not atomic_environment
| atomic_environment atomic_environment*
atom : atomic_environment
| "[" [atomic_environment] "]"
| "[" [atomic_environment] ":" NUMBER "]"
// bond primitivies
bond_primitive : any_bond
| single_bond
| double_bond
| triple_bond
| aromatic_bond
| any_ring_bond
any_bond : "~"
any_ring_bond : "@"
single_bond : "-"
double_bond : "="
triple_bond : "#"
aromatic_bond : ":"
// composition
bond : bond_primitive | bond connective bond | not bond | bond bond*
environment : atom
| environment bond environment
| environment "(" (bond environment) ")"
%import common.NUMBER
Aside from being a nice learning exercise, I think writing down such a grammar is useful for a few reasons:
(1) to help specify what subset of SMARTS / SMIRKS language we accept, in a toolkit-independent way. Currently there are some inconsistencies in how the different toolkits (RDKit vs. OpenEye) interpret SMARTS / SMIRKS. These differences include differing semantics for some atomic primitives. Even restricting to the (large) set of consensus atomic primitives, there are some differences in grammar between OpenEye and RDKit SMARTS, of unclear practical relevance. For example, OpenEye will accept patterns with trailing bond symbols (e.g.`'[*:1]~', '[*:1]-', '[*:1]=', '[*:1]#', '[*:1]@', '[*:1]:'`), even though I don't think these can be derived from the grammar rules in the Daylight spec (and the RDKit parser correctly rejects these). These strings are also rejected before they hit any toolkit by string-based checks performed in the `ChemicalEnvironment` constructors, but are happily accepted by the `chemical_environment_matches` function. Writing down a formal grammar of what primitives can be used in what combinations seems a more systematic way to proceed.
(2) to translate between SMIRKS and other chemical environment representations more amenable to sampling/optimization. Directly working with string representations is a bit of a nightmare, and I am still wrapping my head around how best to use the environment representations in chemper. We could consider representations that are closely related to the structure of parse-trees of SMIRKS, or we could consider representations that are further departures from the SMIRKS syntax.
(3) to automatically translate from the cryptic shorthand of the SMIRKS syntax to more human-readable representations.
(4) to allow me to better measure (and penalize) the "complexity" of a SMIRKS pattern. Initially we applied no penalty to the complexity of a SMIRKS pattern. Later, I applied a data-dependent penalty to the behavior of a whole decision-tree of SMIRKS (essentially penalizing trees whose behavior produces highly imbalanced leaves). Measuring aspects of the parse tree of each individual SMIRKS may provide a route to measuring and penalizing overly complex chemical environment descriptors.
(5) to mine for repeated subtrees / chemical environment “sub-concepts” used in smirnoff force-fields so far, possibly allowing for performance optimization relative to generic algorithms for SMIRKS-matching when applying simple queries to large databases. After printing out and staring at a big list of the SMIRKS patterns used in Parsley, it seems apparent that there are many repeated sub-expressions across the 300 ish SMIRKS patterns we look for in each molecule. These could each be computed once, then reused many times, if we had a way to automatically extract repeated subexpressions from the SMIRNOFF file and then “schedule” the computations that rely on these sub-expressions.
One oddity that jumped out during this exercise so far is that I don't know what to do with syntax like [*:1]1~[*:2]~[*:3]1
, which appears in the definition of angle a3 in Parsley https://github.com/openforcefield/openforcefields/blob/7300a486581feff508b9401241443f941924783f/openforcefields/offxml/openff-1.0.0.offxml#L103 This syntax is not described in the Daylight page about SMARTS syntax, and it is not described on the OpenEye page about SMARTS syntax. My assumption is that this syntax describes ring closure as in SMILES, (so atom :1 is bonded to atom :2 is bonded to atom :3 is bonded to atom :1).
[*;r3:1]1~;@[*;r3:2]~;@[*;r3:3]1
^ ^
A note about useful tools
I’ve been using the lark parser-generator library for Python, which generates fast, correct parsers for context-free grammars. I like this tool a lot!