2023-01-16 Espaloma/GNN meeting

Participants

  • @Lily Wang

  • @John Chodera

  • @Jeffrey Wagner

  • Yuanqing Wang

  • @Michael Shirts

  • @David Mobley

  • @Pavan Behara

  • Timotej Bernat

  • @Matt Thompson

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Context

 

  • JC – YW graduates in a few months, looking at future plans. Will be an independent researcher at NYU, may continue to work on this. Thinking this could be like a new toolkitwrapper and ambertools-style charge generation method. Thinking that the goalposts should be like “no greater than the difference between“ OE and AT.

  •  

  •  

Current Espaloma performance

JC/YQW

  • YW shows espaloma-charge

  • YW – Espaloma-charge is Pip installable, some more work to go for the c-f package (need dgl package, when isn’t on c-f yet). Some additional stuff to iron out as well.

  • JC – Should we consume OFFMols directly instead of RDKit mols?

  • YW – There may be some risk of mangling/misinterpretation if we go to rdkit. So it could be cleaner to read offmols directly.

  • JC – the toolkitwrapper is called “espaloma-am1bcc”. Do you have a standard naming system? Or a specific way to structure it, e.g. “toolkitwrapper-am1bcc”, or versioning etc

    • JW – No standard yet. We just want to avoid users calling “am1bcc” and getting something they don’t expect.

    • JC – could also use a local pytorch model on instantiating the ToolkitWrapper

  • MS – So, you load the model in - does this include the way that the model was created? How do you generate the fingerprint vectors from the graph?

    • YQ – That’s written in the package.

    • MS – So the model file (.pt) contains the vectorization rules?

    • YQ – No, the vectorization rules are fixed. The model file only contains the weights.

  • JC – I think total charge is still defaulting to 0…

  • YW – Time performance of espaloma-charge with CPU backend only breaks 1 sec around ala100. On GPU it never goes over 1 sec

  • YW – shows rmse_total.png

    • YW – Espaloma-charge is comparable to AT. For very high charges EC has a higher error than AT, but it also has fewer outliers.

    • LW – Which dataset?

    • YW – SPICE dataset, with enumerated protonation states.

    • JC – Used OE to enumerate protonation states. SPICE set was used to be representative of peptides, small mols,a nd small biomols. Which elements are covered? Features are hard-coded so we may need to do some future-proofing if later models will support more elements. This may add some fragility to deployment.

    • YW –

  • JC will post HFE plots here

  • JC – So, we wanted to promote this as an alternative to sqm. Do folks want to see more? Is there something more that should be added to the paper/

  • LW – We’ve been looking at RMSE over electrostatic potential. Would be good to run this over SPICE, since freesolv isn’t super representative. I have some code I can share.

    • JC – More details? How do you measure deviation?

    • LW – Just RMSE on the charge grid points right now. So, same conformer, same grid, just measuring difference of charge values from different methods.

    • JC – So, same grid, same conformer, different charge method. And we compare charge mthod deviations to the AT-OE deviation.

    • DM – CBy also suggested this.

    • JC – It’d be good to look at RESP-AM1BCC deviation as a baseline. Was this looked at before in any papers? That could be one of the early justifications since AM1BCC was intorduced as a drop-in replaement to RESP.

    • LW – I’m not sure up to date on this, it may be that this was justificiation for a subsequent paper that refit BCCs.

  • MS – LW’s previous work was also nice because it tried to classify outliers by chemistry.

    • JC – We tried to do this using HFEs. But I’m not sure what the “gold standard” is for HFE measureing methods. Would 10ns of sim be enough?

    • DM – Looking at differenced between models, no need to be restricted to freesolv. FS is basically a “best case” dataset, most mols have just one highly polarized component. So other datasets may be more representative.

    • JC – Suggestions for another set? We just use FS because other people do.

    • LW – I’ve curated a set of poorly performing molecules from different metrics, and by classifying difficult funcitonal groups. A lot of this was manual/opportunistic, I havne’t automated this curation yet.

    • JC – I can find mols with large charge differences. Could go through SPICE to find this.

    • YW – I have a breakdown by RMSE for each molecule. Could provide that.

    • JC – High scorers on that would be a useful dataset. But the most believable metric will be HFE values.

    • DM – Talking to people like swope, nerenberg, bayly, they’re more interested in ESP RMSE.

    • MS – ML people often want to see discussion about outliers. ML models can sometimes perform well overall but have bad outliers so this is an important component to put in the paper.

    • JC – LW, could you share method for ELF10 confs and average ESP error between two charge models?

      • LW – Can do. This relies on haivng confs saved already, oitherwise they may take a while to recompute.

    • LW – My data looks similar to the espaloma charge data. Low average deviation but bigger outliers.

    • JC – Maybe this is due to some facet of the training? We do a squared penalty function. Maybe we should penalize on MAXIMUM deviation rather than average. How bad are outliers/should we additionally penalize outliers?

    • YQ – I don’t have a strong intiuition here, I think squared loss should be appropriate without further penalties to outliers.

    • JC – MAyhbe we could weight by chemistry? Eg deviations on aliphatic carbons might not be a big deal, but on charged/polar groups might be worse?

    • YW – But if we explicitly list a bunch of chemistries to apply different penalty weights…

    • JC – In this case maybe I’m worried about one outlier driving the optimization at the cost of everything else.

    • YW – Not sure how L1/L2/cosine loss/other loss requires further experimentation to say that one is solidloy better/worse than another. Not sure whether one of these would affect the overall result.

  • DM – Since we’re running low on time, let’s get back to paper and if anything else needs to be contributed. And what would need to be done before adding this to toolkit.

    • JC – If LW works with us on the RMSE measurement she could be added as coauthor.

    • JC – Integrating into the toolkit is up to OpenFF - Let us know what we can do if there’s something you need from us.

  • MS – We should probably figure out the overall plans for GNN charges in OpenFF.

    • DM – In nagl there’s an similar but alternate approach. So the big question there is whether one is significantly better.

    • LW – That’s up to JW to decide how many models we want to support and maintain. Being conda installable is a big bonus.

    • JC – But we should decide on the charge model for rosemary if it will be GNN based.

    • JW – For product ownership reasons, Rosemary 3.0.0 will use librarycharges.

    • LW – Beyond Rosemary…

    • JC + MS – …

    • LW – We’re not going to rely on graph charges for 3.0.0. 3.1.0 (or whicheverminor release has GNNs) will target AM1BCC charges. Targeting something better than AM1BCC isn’t in the current roadmap.

    • DM

    • MS – For 3.1, we could have multiple GNNs…

    • JC – I don’t think so, we want to have just one imple,enmtation for reproducibility.

    • MS – So questions are about infrastructure and dataset.

    • JW – I think we should let anything that we certify to provide AM1BCC charges provide them.

    • MS –

    • DM – We currently have “am1bcc” and that can be either AT or OE

    • JC – But that’s bad

    • DM – We could put out a series of benchmarks and certify anything that passes.

    • MS – We need to clearly state what you get to generate am1bcc charges

    • JW – that would result in a lot of work reimplementing scripts for companies and other users

    • MS – I’m uncomfortable with the possibility that differnt charge providers could have signficiant outliers like we’re still digging up here. Until this improves I don’t think the current GNNs would pass an equivalence check. Even before then, it’s not clear that AT-OE difference should have been accepted to begin with.

  • MS – Where should we continue this discussion?

    • LW – We could have another meeting on this…

    • (which dat did we decide on?)

    • next FFR meeting

  • MS – My student Tim is on this call, I’d like to loop him in so he can ocntibute as well.

    • JC – I could coordinate TB and YW getting up to speed on espaloma and espalomax.

    • MS – Would be cool to continue CD work on polymers - Could use the same polymers.

  •  

Espaloma publication

JC/YQW

  • Action items

    • Lily will send code for generating conformers and comparing ESPs to Yuanqing

    • Lily will set up new agenda for force field release; invite all participants here to this week’s meeting

GNN roadmap discussion

LW

  • Integrate GNNs for partial charge calculation into OpenFF toolkit if performance lies within AmberTools range across spice and OpenFF benchmark datasets

  • Improve NAGL usability and extensibility

Action items

Decisions