/
2021-12-06 Core Developers meeting notes

2021-12-06 Core Developers meeting notes

Participants

  • @Pavan Behara

  • @Matt Thompson

  • @Chapin Cavender

  • @Jeffrey Wagner

  • @Simon Boothroyd

Discussion topics

Item

Notes

Item

Notes

General updates

  • JW – For Rosemary charges, PIs are leaning toward a strategy of “graph net charges based on AM1, followed by SMARTS-based BCCs”, with “normal library charges” as a backup. Just mentioning it here so we make sure to bring it up in ff-release call. I’ll bring this up for discussion in the FF-release meeting, but in case I don’t, CC you may want to ask any PIs present about it.

    • CC – Surprised this didn’t come up in roadmap planning meeting.

    • SB – That seems right at a high level. Their thinking is that, even if we can’t get graph-based charges, we can have something that handles nonstandard AAs and post-translational modifications. Relies on getting graph charges that are consistent/high enough quality.

    • PB – Plan for vsites?

    • SB – My plan for that would be that the graph net would capture AM1 charges, and then training BCCs on top of that. So then training VSites would be like training BCCs.

    • MT – Would this lead to a hard dep on pytorch?

    • SB – In principle, yes. The goal would be to have this handled through DGL, so it wouldn’t be a hard dep on pytorch, could eventually use jax of tf.

    • MT – Last I looked into this, DGL isn’t get adoption in all the ML packages (like jax).

    • SB – There is some uncertainty, but even if we only support pytorch, there’s a simple CPU package that can get the job done.

    • MT – I’d be concerned that users may just rely on librarycharges if they have trouble installing pytorch

    • SB – The rosemary FF itself will specify only one charge method - Either librarycharges or graph net - So the FF will EITHER use GCNs and be able to handle PTMs, or it will use librarycharges and not be able to handle PTMs.

Individual updates

  • SB

    • Mostly centered around convolutional charge models - nagl package. Kinda like espaloma, but more geared toward specifically charge prediction. Working on how to label ~100000 mols with partial charges and fractional bond orders, scale up storage, train in a distributed way.

    • Potential limitation of GCN models is resonance. This is because atom featurization looks at one resonance structure, which may read too far into specific formal charges and graph bond orders. So I’ve been looking into how to handle multiple resonance forms. One way was from Gilson’s vcharge paper, which I’ve made some progress toward implementation, but enumeration of resonance structures is really hard (been working with MG on this)

    • Worked with Horton on AbSolv, used for fitting to liquid properties. This does parameter fitting vsites and custom LJ forms.

    • JW – This is great, thanks for being so forward-looking. Having this scouted out wil give us a vaible path forward, even leadrership hasn’t made a decision by then

  • CC

    • Submitted dipeptide torsiondrive set to QCA. Running on TSCC. Completed about 7k optimizations. This is what I’d expect for doing a 1D torsiondrive, I’m interested to see how many more the 2D aspect requires

      • JW – Do we expect this to run in less than n_torsion_steps^2 time?

      • CC – That’s what I’m trying to answer. Because its a multidimensional torsiondrive, it approaches each point from multiple directions, which can lead to the same grid point having multiple optimizaitons kicked off. But many of the torsions (like sidechains) are constrained so I’m optimistic that we won’t need to do every possible optimization

      • CC – Is there any way to query how close the torsiondrives are to converging?

      • PB – You can query the torsiondrive result in QCA, not sure if you can look right at the convergence status, but there may be other trends that you can pick up.

      • SB – Did you consider doing preoptimization experiments with a cheaper method before doing heavy QM?

      • CC – This dataset is an experiment to see whether that’s needed

      • PB – Could pre-optimize by running XTB torsiondrives, then submit full QM.

      • CC – I pre-optimized this dataset using Sage.

      • SB – Could also preoptimize using ffnnSB

      • CC – We were hoping to get rough timing info from this. So, like, if this all finished in a week, then we wouldn’t need to do anything more advanced for the full FF fit. But now it seems like XTB preoptimization could be good.

      • CC – Currently experimenting with number of grid points, sidechain configurations.

      • PB – Looking at dataset, 700 optimizations per day seems pretty low.

      • CC – Have two managers running on TSCC (16 tasks) DDotson has a manager on PRP.

      • PB – What’s your job limit?

      • CC – Unlimited on preemptible queue.

      • PB – Could be good to increase that.

      • SB – It’s good to learn how much pre-emption ends up happening when you make larger requests. I had trouble when I maxed out pre-emptible queue usage and had tons of jobs get killed, then they took a day to get re-submitted to the queue.

      • PB – Error cycling runs 5x per day now.

      • CC – I’ll check on this.

    • Received one of the outstanding sections for the paper, still one outstanding. Will update later.

    • Working on deciding which datasets to use for a solvent-aware electrostatics dataset for rosemary. Will share plans on confluence and open it up for feedback.

  • MT

    • Upcoming API breaks (v0.11.0, ~February)

      • simtk namespace dropped

        • SB – Did you make any progress on Yank+OpenMM issue? May want to keep compatibility with old simtk until this can be resolved.

        • MT – I recall SB and LW working on this, but I didn’t track where it went.

        • SB – There were two issues -

          • One was YANK+simtk NAN’ing out. (SB having audio trouble, can send JW the text of this if it’s important/wrong here)

          • The other was that yank+openmm7.6 on cuda gives incorrect energies.

        • MT – Are those the same issue?

        • SB – Not sure. The second one is reproducible.

        • MT – I’ll bump this up on my to-do list.

        • SB – Had repro case posted in #developers+old meeting notes. It requires CUDA.

        • MT – I have a nvidia card, will try to reproduce.

          • Issue now raised:

      • openmm now “optional” / only required when interacting with OpenMM API

      • Units now provided by openff-units

      • Elements now provided by mendeleev

      • MDTraj no longer needed (was not use in core functionality)

    • All/most OpenFF packages will need to be updated and re-released. MT can help.

    • OFF-EPs in the works

      • No impact on users yet, maybe big changes in ~2-3 months

      • JW –

      • MT – Big upcoming one has yet to be drafted, will touch on how nonbonded stuff is encoded in SMIRNOFF spec. Right now it’s very openmm-centric. This will take a little while, should be able to handle automatic upconversion of existing FFs.

    • Psi4 upstreams on conda-forge coming along …. slowly. Until we get more resources in this direction, this will more slowly and will depend on Lori’s availability. The big push will be for libint2.

      • JW – Seems like big blockers are Lori’s time and libint2.

        • LB’s

        • MT – psi4’s use of libint2 and its upstreams has a lot of debt/cruft that may make it hard to get c-f approval. o

      •  

  • PB

    • Mostly worked on WBO interpolated torsions. I was looking at the applicability of a biaryl-no-ortho-subs smarts pattern on public datasets like JACS 2015, ChemBL, QCA. I generated some QM data locally for 71 molecules and 20 conformers each from JACS ligands. And, out of around 38K matches on ChemBL (sdf file from bindingDB website) I picked around 140 molecules, generated conformers and ran an optimization dataset locally. From all the public datasets the range of WBOs is still concentrated around 1.0 and inclusion of wbo-interpolated parameter for this smarts pattern doesn't seem to significantly improve the force field.

      • another issue we looked at was how the wbos from toolkit generated conformers differ from QM wbos, there is a moderate difference in some cases, and also AM1 wbos generated at the QM optimized geometries seem to have a bimodal distribution (they seem to be shifted by a constant value)

    • While doing this work I looked at how Sage is doing with respect to Biaryls from Rowley/OPLS sets. There is still room for improvement as only three parameters match to the overall Biaryl set, t43, t47, t74 (sage numbering), all of these parameters have wildcards on both ends. I picked up a specific smarts, derived from t43, looking at the torsion profiles. Will try to dig deep and do a fit.

    • Regarding fitting experiments, for the triple bond parameters with high force constants one HMR test is failing. David Mobley suggested manually changing the C#C (b28) parameter and see where it fails and choose the last working value as starting point. The final optimized value, where it was failing is 2500 units, and checking a range of 1600-2400 passes all HMR tests. Refit with 2400 as starting point optimized to ~ 2430 units but again one HMR test was failing so have to redo with a smaller value.

      • MT – I realized we have a “yes, we support HMR” policy that’s not really written down but is a big constraint in our work. Same with “no, we don’t make windows builds”.

      • JW – I’ll be working on documenting practices/decisions/policies in January. Adding this to my to-do list for then.

    • Some old stuff, refit of Sage adding more amide targets from WBO conjugated series, and aniline torsion sets didn't help. Will check how to improve that.

    • This week, look at how XTB WBOs compare with QM WBOs, also have to work with Meghan Osato on torsion multiplicity.

    • PB – I’ve generated some amount of useful QM data locally. Can we have nonstandard datasets uploaded to QCArchive in a way that doesn’t take days/require lots of approvals?

      • SB – This may be more of a MolSSI question - We’ve asked them about accepting uploaded datasets, but they seem to handle these on a case-by-case basis.

      • PB – More specifically, I’m thinking of whether we can have standards-compliant and non-compliant datasets. So, I want to submit non-standards compliant datasts for computation in a way that doesn’t require approval.

      • SB – If it’s an OpenFF project, it should get approval from OpenFF team to make sure it’s on the radar for resource allocation. But if it’s all your compute then it should maybe go directly through MolSSI rathern than qca-dataset-submission.

      • JW – I think the core issue is that the qca-dataset-submission repo uses Dotson’s QCA API token.

      • PB – I have manager credentials. That may be the same as write access?

      • JW – Not sure, let’s ask Dotson at the next QC* meeting.

      •  

  •  

  • JW

    • Worked with MT to make a bit more progress on topology refactor performance

      • SB – Wondering about using this - I want to make molecules from sequence, is that in scope?

      • JW – This isn’t in scope. Could bedone using AmberTools/RDKit.

      • SB – RDKit doens’t let me control protonation.

      • JW – AT should. I’ll make a code snippet to go from seuqnece--> OFFMol and send it to you

    • SMIRNOFF EP feedback

    • Some user support

    • Talked with Meghan Osato in the Mobley lab, who’s trying to put together a few workflows using interchange

  • PB – If I load a big multi-molecule SDF file via oemolistream, it goes a lot faster than OFFMol.from_file.

    • SB – I’ve also noticed that this is slow. from_openeye is very slow.

    • JW – I haven’t tried to optimize this. Unfortunately we’re short on cheminformatics experience and so I’m hesitant about code changes to our cheminformaics code.

    • SB – If you want both a test set and a loading-time-benchmarking set, things like the enamine set, chembl, NCI250k are what I usually use.

    • MT – So, goal would be to get a reference output for one of those big sets, and then tinker with implementation to see if we can get the same outputs, but faster?

    • SB – Yes

    • MT – I can put this on my to-do list, kinda on the backburner compared to other things.

    • JW – Previous logic was “if it’s not slower than AM1, it doesn’t need optimizing”. But this doesn’t hold for molecules with existing patial charges, or if we get a graph net charge method.

    •  

Action items

Decisions

Related pages