2020-08-28 Chemical Perception meeting notes

Date

Aug 28, 2020

Participants

  • @Hyesu Jang

  • @Trevor Gokey

Goals

  •  

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

 current state of chemical perception efforts

Trevor Gokey

  • TG – I’d like to start looking at automating chemical perception. Like merging bond types and stuff.

  • JF – I had done some initial work looking at merging/splitting GB types. I used SMIRKS typing trees, but ran out of steam. I could improve documentation of what’s there.

  • DM – IIRC, we could get this to work for simple typing, but if it were expanded to bonds+angles, the combinatorial space gets really big.

  • CCB -- Yes, conclusion from SMIRKY paper is that chemical space is REALLY big. It would be effectively impossible to run that much QM.

  • JW – Would 100x cheaper QM calcs (eg. ANI) unblock SMIRKY problems?

    • CCB – No, search space is WAY too big. Even not running ANY QM took a week to get anything useful done.

  • (General) – Hard to know how to take steps in chemical space

  • TH – My idea: Parameter type definitions should in some degree resemble physics. So could QM results inform typing? Looking at atoms-in-molecules approach, which looks at things like the electron density distributed around atoms, bond critical points (e- density maximum between atoms), bond orders. This seems like it would give us minimal bias.

    • DM – How would this be coupled to substructure searches? Would there be a clustering step at the end of this?

    • TH – Yes, there would be a clustering step at the end of this. So, like, if we look at all C-O-C angles, and do AIM analysis of all of them, then we could cluster them into different subsets. Then we’d be tasked with finding patterns that make the same distinction, and I’d like to use Chemper for that.

    • CCB – That’s exactly what Chemper is built for. One of our big barriers was automating the clustering in such a way to have a little chemical intuition, since big datasets will have really complex/borderline cases at the border region.

    • DM – One idea is to exclude the borderline points from the initial pattern generation, and then do the last bit by hand

    • CIB – Looking at the example of the amide bond, I’d (lost the trail)

      • (something something) gradient of force constant in the fitting. So when FB is doing a fit, it has a gradient. So it sees whether each isntance of a parameter wants to make the value bigger or smaller. We could then make a decision tree to try and distinguish instances which have positive and negavie gradient contributions (or a great many decision trees, ie a random forest). And/or we could look at the maximal instances of parameters that want to make the torsion bigger/smaller, and start generalizing them until they cover the other cases as well.

    • CCB – In Chemper, once you have the descriptors, you could cluster however you want. So in principle that would work.

    • TG – In my mind, I’d be using bit vectors which are analgous to the gradient-contribution-sign-grouping. There’s probably some overlap but I’ll need to flesh it out.

    • JF – Really like the idea of using gradient information. Would be cool to couple it to couple it to, instead of a decision tree, a list of “proposals” for how the pattern could be modified. Then instead of uniformly selecting a random proposal, we could be a little informed about which “proposals” are most likely to be helpful in separating parameter types.

  • MKG – On one hand, if you look at doing random sampling, it’s guaranteed to be unbiased. But if we do nonuniform sampling, we might wind up with biased results. Are we concerned about that?

    • DM – Agree that results will be biased, but we currently lack any systematic method to improve typing, so we’ll take the first thing that works

  • DM – Summary –

    • TH’s approach – Let’s do typing from scratch

    • TG and others' approaches – Let’s look at how we can change our existing typing

  • Continue fleshing out proposals and we’ll discuss further at subsequent meetings.

fixing amide issue update

Hyesu Jang

  • HJ presents slides

  • CB – What’s in the optimization set that’s looking for a non-planar amide?

  • DM – Might it be a perception issue? Are planar amides getting lumped together with something that shouldn’t be planar?

    • JW – Might be in Trevor’s plots.

    • TG shows plots of systematic changes between s99F and openff-1.0.0

    • DM – Maybe we could see parameter assignment

  • CIB – Seems to be a fundamental disconnect between what our optimized parameters do, and what we’re fitting to. One of the other ways to deal with this would be to dump a bunch of amides into the fitting data. So, for example, would our fitting machinery be able to fit planar amides if that’s all its trained on?

  • MKG – Could construct a minimal planar toy system, and then once we understand its behavior, add one that’s nonplanar due to sterics and see how that’s handled in the fitting.

  • HJ – I’ll look into minimal working examples of this.

  • CIB – We may also want to check PROPER torsions to ensure that patterns involving whether carbonyl carbons and amide nitrogens all evaluate to planar.

  • HJ shows plots of t70 angle distributions. 90%ish of points appear to group at 0 and +-180

  • CCB – Have we refit improper angles?

    • HJ – No. But we do have improper patterns specifically to match the amide carbon and nitrogen.

  • DM – Would like to look at the molecules that are caught by the involved parameters, and ensure that they’re all

  • Summary – We think it’s one (or more) of the following:

    • There’s a chemical perception problem and some non-planar nitrogens are getting lumped in with amides

    • There’s a dataset composition problem – We just don’t have enough planar amides in the training data

    • There’s an objective function weight problem – The data is there, but the objective function contributions are too small and get overwhelmed by other terms.

Carbyne issue

Chris Bayly

 

Action items

Decisions