DM – Split by diversity = diverse molecules in both sets, or …
LW – For the validation set, I tried to pick a broad set across the whole dataset
CBy – When you say “no more than 10 molecules each”, could you have chosen less? Could you have chosen 0?
LW – What riniker’s paper did is compute the atom environment for all envs as an atom pair fingerprint. Then they pooled it with different limited pool sizes.
CBy – So, if one pool had less than 10 molecules, they’d put them all in?
LW – Yes
DM – So, if there’s something hard to avoid, like an aromatic carbon ring, that would be included several times in the pools for other groups, right?
LW – Yes
CBy – Would this be missing any data? Are there some atom environments that were very poorly sampled?
LW – Yes, that would be from the initial dataset we looked at.
CBy - -With AM1BCC, there’s a whole pathology when you have an internal hydrogen bond, so that’s when ELF is used. Did you do something ELF-like?
LW – Kinda…
SB – When we picked conformers, we used an ELF method
DM – Also worth keeping in mind that this talk is focused on picking the graph net.
MS – Is it possible that Hs are weird because they only have one connection to the graph? So they have fewer connections to their environment?
LW – Possibly. I followed Riniker’s method pretty closely. I think that should account for diversity
MS – Also, if the Hs are off, shouldn’t another atom be off as well? Or maybe it’s offset to other hydrogens in the other direction?
LW – The ambertools molecules are still computing, so those may have outliers once the data comes in.
SB – (detailed question, see video, around 25 minutes)
(poorly performing molecule slide)
PB – Did you use the high-energy conformations from SPICE?
LW – These just took graphs from SPICE, then recomputed using toolkit
CBy – There are a lot of sulfonates/sulfonic acids represented here.
LW – I wonder if those are underrepresented in training
SB – When I did this originally, I also found sulfur to be really tricky. And that was on the industry test set, not the SPICE set.
SB – Also, I found that some of our sets had a lot of sulfonic acids, just to keep sulfonates, and the issue persisted. So I don’t think it was sulfonic acid in the protonations tate causing problems.
LW – So you filtered sulfonic acids out of the test set?
SB – Yes
CBy – I’m also seeing a lot of carboxylic acids instead of carboxylates. That’s a frequent source of user input error if they don’t try to keep things close to physiologic pH
CBy – Re: population histograms - The more polar a bond is, the more the effect on polar atoms/hydrogens. So those may be the biggest contributors to the max charge difference plots.
CBy – When you have a monopole with a carboxylate (like RCO2-), you can get things pretty wrong as long as the charge is on the C or one of the Os, and it’s fine from an outside perspective. So is the question whather we get the point-centered charge right, or should we do something like look at freesolv, and do PB calculations… then use that as a comparison.
CBy –
I really like this work, like becoming conformationally independent
I know a lot about the systematic deficiencies in AM1BCC - They’ll underestimate aldehydes, high-valent sulfurs. So training to AM1BCC is great, but it has these known defects. So in the long run we may benefit from training to an ESP-like model.
I think fitting directly to electrostatic potentials would avoid a lot of the numerical instabilities/problems that charge method training usually encounters.
Because we use AM1BCC right now, the shortest gap is to produce a replacement that performs the same (and that’s what would be made here). The better thing would be to train to ESPs.
When I look at how the hidden layers talk to each other, I start to worry about the effects of delocalization. PB made a dataset that looks at aromatic groups with electron donating/withdrawing substituents. So AM1 can account for delcalization. Can the GNN do this? Does it?
LW – The architecture can. I haven’t probed this question directly. So I could run on PB’s dataset as a test.
SB – Some trickiness there. For 1, you can only look so many neighbors out. So we by default look 4 bonds out, which can miss things.
SB – But I think LW is looking directly at some other delocalized situations, that should be informative
CBy – Para-substituted benzenes are the most direct test.
SB – Direct ESP fitting: That was looked at previously, but the computational expense to generate the training set was quite high. So when we trained+tested on small fragments the performance was quite bad.
SB – What you’d kinda think is that the NN would learn a distinction between buried and unburied atoms, but it didn’t seem to do that in my case.
SB – In terms of architectures, edge+global features would be really useful, and especially giving total charge would be cool. There’s a recent paper in JCIM where they looked at solvation free energies using NNs, and it looked a lot like what we’re doing, so that may be cool to look at. There’s also a model recently released by twitter that improves on stable diffusion, and they claim it can look 10 hops, so that would be really promising.
MS – In terms of distance, we’ll eventually want to capture conjugation effects. So I wonder if there’s some way to include some long-distance features but not others. Maybe it would start by designing a dataset to capture this.
SB – I think one way to do this would be to include specific features that capture this - For example we enumerate resonance forms and average formal charge. But the issue here is that resonance enumeration is combinatorial, so while I have some tricks to reduce this, but it still hits limits in big systems.
CBy – Could we look at atoms and say “oh, this is conjugated”. So maybe each atom could have some opinion on whether it’s conjugated, and is electron donating/withdrawing.
LW – I think this is where edge features would come in really handy, though calculating partial bond orders is a slow thing unto itself.
SB – The MGilson method may capture this
CBy – Maybe aromaticity could fill this gap
LW – Agree - Fitting to ESPs is a good ultimate goal. But as an intermediate goal, training to AM1BCC is a good bridge
CBy – “Blocker to rolling this out” – Shouldn’t we look at how these charges perform on coulombic interactions? Like, look at ddEs compared to AM1BCC?
DM – Maybe not “the same”, but “as good”
JW – Agree
MS – How does runtime look?
LW – Much faster than AM1BCC, probably a fraction of a second for even “big” small molecules
DM – Oh, potential legal problem - We may need to get approval from OpenEye since we’re using their data as a reference. I’ll reach out to Ant and Jeffs.
JW – Can we enumerate problems with rolling this out to a FF?
DM – Sulfur containing molecules -
JW – Troublesome hydrogens?
DM – Legal rights for training
MT – Ensure that an charge assignment engine can be packaged.
CBy – Runtime may be an issue here.
JW – Could we cut down runtime by using a lighter-weight ML model application engine (lighter than pytorch)
SB – I think it should be possible. Also look out for blockers related to DGL getting onto conda forge
LW – Are the people with power interested in getting DGL on conda forge?
SB – A lot of the issue is that DGL vendors a lot of subpackages that need to get ported to c-f. During my time the issue was metis. Now it may be something else.
MT – I think we shouldn’t rely on DGL getting ported, try to find a slimmed-down pytorch that we can use today.
MT – Needs to be a SMIRNOFF EP.
JW – I don’t think it needs to be put in the SMIRNOFF spec - It’ll just be a piece of software that says “I can provide AM1 charges”
JW – Would be good to put an entire protein through to check runtime and librarycharge-like outcome
MS – Crosslinked polymer that’s 100k atoms?
CBy – Maybe a big conjugated polymer?
SB – I did GLUx300 earlier, this took 5-10 seconds, most of it was resonance enumeration.
SB – Guardrails on really weird inputs - Like checking for hydrogens with a +1.01 charge, and issueing a warning
MS – Maybe comparison to gasteiger?
CBy - MMFF isn’t too bad either.
JW – I think RDKit and OpenEye offer both.
LW – That’s a good idea. But running MMFF on a big molecule using my macbook.
SB – Even a population analysis would be good.
JW – We should reach out to Yuanqing and see how we can align efforts with espaloma.
JW – Other datasets?
SB – Chodera lab did NCI AM1BCC, maybe some others
DM – Maybe Enamine?
SB – Enamine technically says “dont pull all our stuff down”, but we emailed them and they said it’s fine. I’ll forward you the email.