2022-02-28 Chemical Perception meeting notes

Participants

  • @Trevor Gokey

  • @Chapin Cavender

  • @David Mobley

  • @Christopher Bayly

  • @Tobias Huefner

Goals

  •  

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

 

@Trevor Gokey

  • TG - easy to compare SMARTS for bonds, hard to compare for molecules; encode differences in binary, then describe bond using graph union (equivalent to bitwise OR)

  • TG - when graphs have different cardinality, provide a SMARTS that will match chemistries with both numbers of nodes

    • DM - what’s the different between [] and [*]?

    • TG - The difference comes from the union operator. [] OR [#1] gives [#1] but [*] OR [#1] gives [*]

  • TG - The union of all matched SMARTS gives the most general parameter, then removal of bits gives specific parameters

    • CB - this seems to depend on what chemical environments are present in the training set. Will this still work in practice when we can encounter new chemical environments?

    • TG - wildcards are dangerous, we should think carefully about what they mean when we specify parameters

  • TG - objective function is composed of physical space term (SSR for observables) and chemical space term (exponential function of number of unique chemical environments and number of bits needed to represent those environments).

  • TG - for alkanes, data-driven parameter set (this work) outperforms Sage by adding only one extra parameter.

    • CC - how do you read these graphs?

    • TG - start with smallest set of parameters at left of graph. In each iteration, identify candidate splits/merges with scoring function, fully optimize 10 best candidates, then plot new physical/chemical objective

  • TG - The ground-truth best choice ([!r4:1]~[*:2]) was 11th best with my scoring function, but targeting only the single best candidate gets pretty close

  • CB - it seems like your objective function favors small gains on parameters prevalent in the training dataset over large gains on sparse parameters

  • CB - I’m still worried that SMARTS like [!r4] will be too general for the wider chemical world outside of your training dataset. Can you generalize your approach to binary-encoded SMARTS to identify how to specify these SMARTS in a bigger space, e.g. Sage?

    • DM - you could give your encoder exactly the Sage training set and ask it to differentiate this parameter

    • TG - this is hard because you have to identify at what stage in the hierarchy you want to search

    • CB - this is equivalent to the wizardry of human chemical intuition. Wizards (chemists) always make some allegedly generalizable choice that is not actually generalizable. This is the problem we want to solve by automation at some point.

    • TG - has this problem been solved, at least locally, in bespokefit?

    • DM - no, because bespokefit is not meant to be transferable

 

 

 

Action items

Decisions