Atlassian uses cookies to improve your browsing experience, perform analytics and research, and conduct advertising. Accept all cookies to indicate that you agree to our use of cookies on your device. Atlassian cookies and tracking notice, (opens new window)
DM – Split by diversity = diverse molecules in both sets, or …
LW – For the validation set, I tried to pick a broad set across the whole dataset
CBy – When you say “no more than 10 molecules each”, could you have chosen less? Could you have chosen 0?
LW – What riniker’s paper did is compute the atom environment for all envs as an atom pair fingerprint. Then they pooled it with different limited pool sizes.
CBy – So, if one pool had less than 10 molecules, they’d put them all in?
LW – Yes
DM – So, if there’s something hard to avoid, like an aromatic carbon ring, that would be included several times in the pools for other groups, right?
LW – Yes
CBy – Would this be missing any data? Are there some atom environments that were very poorly sampled?
LW – Yes, that would be from the initial dataset we looked at.
CBy - -With AM1BCC, there’s a whole pathology when you have an internal hydrogen bond, so that’s when ELF is used. Did you do something ELF-like?
LW – Kinda…
SB – When we picked conformers, we used an ELF method
DM – Also worth keeping in mind that this talk is focused on picking the graph net.
MS – Is it possible that Hs are weird because they only have one connection to the graph? So they have fewer connections to their environment?
LW – Possibly. I followed Riniker’s method pretty closely. I think that should account for diversity
MS – Also, if the Hs are off, shouldn’t another atom be off as well? Or maybe it’s offset to other hydrogens in the other direction?
LW – The ambertools molecules are still computing, so those may have outliers once the data comes in.
SB – (detailed question, see video, around 25 minutes)
(poorly performing molecule slide)
PB – Did you use the high-energy conformations from SPICE?
LW – These just took graphs from SPICE, then recomputed using toolkit
CBy – There are a lot of sulfonates/sulfonic acids represented here.
LW – I wonder if those are underrepresented in training
SB – When I did this originally, I also found sulfur to be really tricky. And that was on the industry test set, not the SPICE set.
SB – Also, I found that some of our sets had a lot of sulfonic acids, just to keep sulfonates, and the issue persisted. So I don’t think it was sulfonic acid in the protonations tate causing problems.
LW – So you filtered sulfonic acids out of the test set?
SB – Yes
CBy – I’m also seeing a lot of carboxylic acids instead of carboxylates. That’s a frequent source of user input error if they don’t try to keep things close to physiologic pH
CBy – Re: population histograms - The more polar a bond is, the more the effect on polar atoms/hydrogens. So those may be the biggest contributors to the max charge difference plots.
CBy – When you have a monopole with a carboxylate (like RCO2-), you can get things pretty wrong as long as the charge is on the C or one of the Os, and it’s fine from an outside perspective. So is the question whather we get the point-centered charge right, or should we do something like look at freesolv, and do PB calculations… then use that as a comparison.
CBy –
I really like this work, like becoming conformationally independent
I know a lot about the systematic deficiencies in AM1BCC - They’ll underestimate aldehydes, high-valent sulfurs. So training to AM1BCC is great, but it has these known defects. So in the long run we may benefit from training to an ESP-like model.
I think fitting directly to electrostatic potentials would avoid a lot of the numerical instabilities/problems that charge method training usually encounters.
Because we use AM1BCC right now, the shortest gap is to produce a replacement that performs the same (and that’s what would be made here). The better thing would be to train to ESPs.
When I look at how the hidden layers talk to each other, I start to worry about the effects of delocalization. PB made a dataset that looks at aromatic groups with electron donating/withdrawing substituents. So AM1 can account for delcalization. Can the GNN do this? Does it?
LW – The architecture can. I haven’t probed this question directly. So I could run on PB’s dataset as a test.
SB – Some trickiness there. For 1, you can only look so many neighbors out. So we by default look 4 bonds out, which can miss things.
SB – But I think LW is looking directly at some other delocalized situations, that should be informative
CBy – Para-substituted benzenes are the most direct test.
SB – Direct ESP fitting: That was looked at previously, but the computational expense to generate the training set was quite high. So when we trained+tested on small fragments the performance was quite bad.
SB – What you’d kinda think is that the NN would learn a distinction between buried and unburied atoms, but it didn’t seem to do that in my case.
SB – In terms of architectures, edge+global features would be really useful, and especially giving total charge would be cool. There’s a recent paper in JCIM where they looked at solvation free energies using NNs, and it looked a lot like what we’re doing, so that may be cool to look at. There’s also a model recently released by twitter that improves on stable diffusion, and they claim it can look 10 hops, so that would be really promising.
MS – In terms of distance, we’ll eventually want to capture conjugation effects. So I wonder if there’s some way to include some long-distance features but not others. Maybe it would start by designing a dataset to capture this.
SB – I think one way to do this would be to include specific features that capture this - For example we enumerate resonance forms and average formal charge. But the issue here is that resonance enumeration is combinatorial, so while I have some tricks to reduce this, but it still hits limits in big systems.
CBy – Could we look at atoms and say “oh, this is conjugated”. So maybe each atom could have some opinion on whether it’s conjugated, and is electron donating/withdrawing.
LW – I think this is where edge features would come in really handy, though calculating partial bond orders is a slow thing unto itself.
SB – The MGilson method may capture this
CBy – Maybe aromaticity could fill this gap
LW – Agree - Fitting to ESPs is a good ultimate goal. But as an intermediate goal, training to AM1BCC is a good bridge
CBy – “Blocker to rolling this out” – Shouldn’t we look at how these charges perform on coulombic interactions? Like, look at ddEs compared to AM1BCC?
DM – Maybe not “the same”, but “as good”
JW – Agree
MS – How does runtime look?
LW – Much faster than AM1BCC, probably a fraction of a second for even “big” small molecules
DM – Oh, potential legal problem - We may need to get approval from OpenEye since we’re using their data as a reference. I’ll reach out to Ant and Jeffs.
JW – Can we enumerate problems with rolling this out to a FF?
DM – Sulfur containing molecules -
JW – Troublesome hydrogens?
DM – Legal rights for training
MT – Ensure that an charge assignment engine can be packaged.
CBy – Runtime may be an issue here.
JW – Could we cut down runtime by using a lighter-weight ML model application engine (lighter than pytorch)
SB – I think it should be possible. Also look out for blockers related to DGL getting onto conda forge
LW – Are the people with power interested in getting DGL on conda forge?
SB – A lot of the issue is that DGL vendors a lot of subpackages that need to get ported to c-f. During my time the issue was metis. Now it may be something else.
MT – I think we shouldn’t rely on DGL getting ported, try to find a slimmed-down pytorch that we can use today.
MT – Needs to be a SMIRNOFF EP.
JW – I don’t think it needs to be put in the SMIRNOFF spec - It’ll just be a piece of software that says “I can provide AM1 charges”
JW – Would be good to put an entire protein through to check runtime and librarycharge-like outcome
MS – Crosslinked polymer that’s 100k atoms?
CBy – Maybe a big conjugated polymer?
SB – I did GLUx300 earlier, this took 5-10 seconds, most of it was resonance enumeration.
SB – Guardrails on really weird inputs - Like checking for hydrogens with a +1.01 charge, and issueing a warning
MS – Maybe comparison to gasteiger?
CBy - MMFF isn’t too bad either.
JW – I think RDKit and OpenEye offer both.
LW – That’s a good idea. But running MMFF on a big molecule using my macbook.
SB – Even a population analysis would be good.
JW – We should reach out to Yuanqing and see how we can align efforts with espaloma.
JW – Other datasets?
SB – Chodera lab did NCI AM1BCC, maybe some others
DM – Maybe Enamine?
SB – Enamine technically says “dont pull all our stuff down”, but we emailed them and they said it’s fine. I’ll forward you the email.