2020-09-18 Chemical Perception meeting notes

Date

Sep 18, 2020

Participants

  • @Hyesu Jang

Goals

  • Standardization of datasets and infrastructure for testing various atom-typing approaches

Discussion topics

Item

Notes

Item

Notes

standardization of datasets and infrastructure

TG - limit to the simple-as-possible chemical space, hydrocarbons with some oxygenated molecules, CHOs.

DM - start from Josh’s dataset / AlhEthOH. AlkEthOH is a dataset of molecules that CIB drew, but weren’t selected for having associated experimental data

  • TH – What is the heavy atom count for these mols?

    • CIB – <18

  • TG – Most interested to see if automated typing can identify ethers vs other contexts for O

  • CCB – There’s an expanded test set that’s in the SMIRKY/chemical perception paper SI(https://datadryad.org/stash/dataset/doi:10.7280/D1CD4C), “PhEthOH”, which includes aromatic compounds as well.

TH – What reference quantities for these molecules are being used? Hydration free energies?

JF – Undecided. Would be good to recover bond spring constants or lengths.

TG -strategy on ring structures? Is it OK if types are shared by different ring sizes, so some have correct geometry but high strain energy?

  • CIB – Not clear. I’ve experienced both approaches. My rule is “Make models from the outset that are robust to this sort of variability”. I encourage the use of datasets for initial testing instead of individual molecules.

  • TG – I’ll submit PheEthOH to QCA

    • TH – Could QC molecules contain electron density? That would help me do my atoms-in-molecules approach. I’ve been able to pull down optimized geometries from QCA and regenerated their e- densities.

      • (TH uses psi4 already, so it will be easy to modify his workflow once QCA stores wavefunctions)

    • TG – There are some hard blockers on wavefunction storage on QCA. So it will come eventually, but it’s not here now.

    • CIB – one utility of the approach, conformational variabilities of the matrix within a molecule.

    • TH – This is a new topic to me, I’ll look into the conformation dependence of the QM properties I calculate in this dataset

      • CIB – Especially look at how conformation affects conjugated vs. nonconjugated molecules. .

    • CIB – We’re well aware of conjugation/charge/e- density transfer and how it affects conformational energetics of molecules. Note that AIM analysis/bond critical points may differ in results from WBO. We should try to store mulliken charges in QCA

    • CIB – Would be good to label the parameters that would be assigned to this set and record parameter coverage. Then the newly-derived SMARTS could be compared to this typing distribution.

  • CIB – Very valuable to TH’s approach will be data involving conjugation.

    • TH – Maybe Parsley training set? Then I’ll have optimized geometries

    • (General) Parsley training sets are pretty large, may not be totally suitable here

    • TH – I don’t want super large datasets, instead want small diverse set (such as conjugated systems)

    • CIB- substructure search, smart string search or MACCS key method

    • TH will find interesting chemistry out of the large Parsley training data sets, and will share the list with TG and JF

  • JW – What comparisons will these methods be building toward? We should establish expectations for how these new proposed typing schemes will be benchmarked, and deciding how to select which ones to use in main line FFs.

    • (General) – More than CHO

    • TG – Maybe Parsley training and test sets?

    • CIB – Our datasets have grown organically into one large dataset. But maybe now it would be useful to have intermediate sets. Each of these methods is substantially different. Maybe they need different datasets. For example, TH could focus on methyl acetate and N-methyl acetamide which exhibit geometry-dependent conjugation.

    • TG – I’ll look at picking up Chaya’s biphenyl set. Expecially biphenyls consisting of CHO.

    • TH – I’ll look at using Chaya’s biphenyl set. And I’ll look at how other properties correlate with AM1-Wiberg bond order.

  • CIB – TG, I expect that you’ll find that C=O groups hugely expand the chemical space of CHO compounds.

    • TG – Agree. Also interested to add conjugated

  • JW – Let’s make sure we set expectations appropriately – Which of these approaches are most amenable to quickly merging into the main line of FF, versus which require lots of work. The people proposing new schemes should know what the criteria are for their results to be incorporates. Also, the parameters are deeply interconnected. The meaning of one parameter depends on other, potentially overlapping, parameters above and below it in the same FF.

    • CIB – Agree that typing changes will be subtle and deeply interconnected.

    • DM –

    • TG – I’m hoping that the binary tree approach that I’m taking will explicitly consider this dependence.

    • CIB – Agree that we’re in kinda a “local minimum” of FF space, and it’s very hard to distinguish how different two FFs are. So there’s no such thing as a “small FF change”

    • TH – This is likely to be a problem for my approach – AIM typing approach will be hard to convert into something that goes into an OFF file.

      • JW – We should be able to plan this development in a timely way, as long as we’'re continuously benchmarking and try to predict when we want to fold these changes in.

  • TH- For the conformationally dependent, how sensitive the ff performance to the conformational dependence. the way they are clustered using gaussian mixture model, conformational dependency might be captured in the model.

  • CIB - similar question to what chaya faced in her research. (wbo dependence.)

  • TH - besides the dataset, there might be an overlap between codes. maybe in the next session, we should talk about how we can optimize the pipeline. high-level discussion on how to structure the code.

  • CIB - Should be good to have Jeff in the call.

  • Link to Chaya’s manuscript: https://chayast.github.io/wbo-manuscript/

Action items

Decisions