2020-07-23 Force Field Release meeting notes

Date

Jul 23, 2020

Participants

  • @Hyesu Jang

  • @Christopher Bayly

  • @David Mobley

  • @Jeffrey Wagner

  • @Lee-Ping Wang

  • @Jessica Maat (Deactivated)

Goals

  •  

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Resonance issue in the usage of SMARTS

 

  • CB – This is called the “gauche” and “stereoelectronic” effect. FCCF and OCCO want to be gauche. If there are three of the same substituent, they compete to be gauche and flatten the energy landscape. In sugars, this effect is manifested in another way.

  • CB – What we need is a parameters that say “any of these end groups” --[O|F|C|Cl|N:1]-[C:2](-[C|H])(-[C|H])-[C:3](-[C|H])(-[C|H])-[O|F|C|Cl|N:4]

    • CB – Gauche effect is only dominant in these cases

    • CB – Nitrogen would need to be aliphatic – May need more detail there.

    • CB – Bromine may or may not need to be included in 1- and 4-spots. Need to look at the Pauling electronegativity.

  • CB – In version 1.2, when we compare t5 and t6, the they values are close. MAybe these could be merged.

  • CB – Imagine you have a torsion that’s OCCF – Then we want to capture that effect.

  • DM – This will lump together cases with the guache effect and cases without.

  • CB – True, but that should be the minority of cases

  • CB – It’ll be good to look specifically at whether the gauche effect is captured in the FF benchmarking

  • DM – Gauche-effect parameter could be far down in the FF, to make sure it only applies when necessary.

  • DM – How important is this relative to other priorities?

  • CB – This effect will be seen a lot in pharmaceutical molecules. It can be a major factor in driving cyclohexane rings away from chain conformation. It may be low-effort to simply throw in a parameter for this and assume that our test set has some representatives of this.

  • HJ – I notice that terminal groups in this effect are sp3. For some reason they have a periodicity of 2.

  • CB – I don’t recall. I think that you have a better feel for this now.

  • CB – Remember that you want to un-favor the trans conformation.

  • LPW – Re the twofold term, If you have three of these terms that are all offset by 120 degrees, then they may all cancel out.

  • LPW – I’m interested in checking for representatives of this

  • CB – Do we freeze core electrons? We may need these unfrozen to capture effect for I and Br

  • DM – Not sure that the accuracy from this is the lowest hanging fruit to pursue right now

  • CB – This will win us lots of med chem cred

  • LPW – I’m interested in building a QM dataset of molecules with linearly independent energy contributions. I’m getting concerned tht our torsion parameters keep occuring in the same combination (like, the same three torsions always occur around the same central bond), so they’re not really fitted independently.

    • LPW – Eg. A dataset of ethane and chloroethane – The two torsion profiles are “linearly independent”, since you can get HCCH and HCCCl parameters very accurately. But it’s not clear that the HCCCl parameter would do a good job of describing trichloroethane.

    • CB – In eg. OPLS3, everything gets a bespoke parameter. But in a general FF, we’re “binning” bespoke parameters to make them broadly applicable. So our question right now is “where do we set the edges of the bin?”. The high-level answer is “where the parameters co-occur the most”. But the chemical universe is so big that rigorously defining that is intractable. We want to end up with a model that is “right”, and the reductionist buildup (characterized by the linearly independent approach) is not guaranteed to get us the most “right” answers in med chem space.

    • DM – which type of dataset gives better FF is unknown/unanswered but something we can answer with our infrastructure.

    • LPW – I wouldn’t argue that the “reductionist buildup” is the right way to go from the beginning. But I’m not asking whether general parameters need to be split. I’m asking whether certain parameters always occur together.

    • CB – Good question. Straddles cheminformatics and physics.

    • CB – Maybe an analysis of correlation between parameter occurences

    • HJ – I did this analysis for the 2nd gen training set. Many parameters always always co-occured around the same central bond. I’ll continue this analysis with a formal correlation analysis.

    • LPW – I could see this working out, we’d have a new QM data that is built ensuring maximal independence between torsion parameters.

    • CB – My experience with AM1BCC is that these things are never independent, and the individual torsion terms we fit will end up being fit to correct for the non-linearities that are really inherent in the system.

    • LPW – We want to find the sweet spot between treating these as being “completely linearly independent”, and “impossible to combine in a linear fashion”, and I’d like to ensure we have representatives of both homogenous torsions and inhomenous torsions in molecles.

    • DM – This could turn into a lot of research projects as we move forward. It’s not clear that there’s consensus on how to do this in FF design.

Design of a new informative set

DLM

  • DM – Jordan Ehrman has previously crafted an “informative” set by finding molecules that MM-minimized to confs far from QM minimum.

    • DM – But, as we pick off problems in this set, we may want to make a NEW informative set that find the new weaknesses in our FF.

  • Use QCSubmit/QCEngine? Now has GAFF, GAFF2, MMFF (via RDKit)

  • With fragmentation?

  • Consider some of Enamine, not just eMolecules?

    • DM + CB – Enamine is almost entirely “pharmcetucially relevant”, probably a subset of relevant space. eMolecules may include non-relevant mols.

    • CB – Fully agree that “informative” sets are the biggest value we can find. JE’s set is “divergent”, where force fields lead in different directions. Those divergent sets are going to end up focusing on weaknesses in MMFF and GAFF, and those aren’t going to change. So we to start targeting OPLS3. This is hard because they will do bespoke fitting, and we can’t match that with a general FF. But we can make up some of the gap, by focusing on Enamine chemistry which is more likely to be pharmaceutically relevant. But we can use eMolecules as a set of guardrails.

    • DM – So, we want to reduce how much we compare ourselves other FFs. So we should focus on good coverage of everything in enamine, and lesser coverage of everything in eMolecules. And then we want to find divergence from QM, not MM, since as CB says, deficiencies in MM forcefields will become constant relative to our performance.

    • CB – we’ve started off as physics people, but our SMARTS work has become more and more cheminformatics. Automating finding outlier SMARTS will be very helpful.

    • DM – VL looked into which parameters are overrepresented in molecules with high errors. Maybe more work along those lines would be good. Could do clustering+fingerprinting, and see which clusters have high errors.

    • CB – MACCS keys should be a good method for this.

  • Personnel: Recruit undergrad? Do we have one on hand?

Action items

Decisions