2021-07-21 Industry benchmarks meeting notes

Participants

Goals

Sage performance
Updates from team
openff-benchmark release blockers
Season 1 retrospective

Discussion topics

Item	Presenter	Notes
Updates from team		DH: failing Sage RC optimizations I looked manually, failing cases had a C#C, C#N, cubane. DD – Did we see the same problem with earlier FFs? DH – No, for the others, less than 1% of optimizations failed. LD – With sage rc1 the failures are around 5%. DH: also checked successful optimizations, there are also C#N cases; so, not all C#N appear to be failing DD – I’m also curious how the result plots are looking. LD – (shows plots, sage rc1 is better than 1.2.1 in TFD, but worse in RMSD and dE) DH – This comparison may be biased by bin size. It may be good to add standard deviations or something like that. DD: looks like we may have recovered some of the RMSD accuracy we lost between 1.2.1 and 1.3.0; gained quite a bit of accuracy in torsion prediction SB: have to take result with a grain of salt, even if encouraging; hard to tell if we’re getting a few really good cases and a number of slightly worse ones, for example JW – Would be good to have cumulative distribution plots, could use small bins in that case. DH – Good idea. Reviewers on the Lim paper asked for smaller bins as well. LD – Yesterday, I had a call with DH on refactor plans. There was one cases where a minimum wasn’t “caught”, and it was because we took the intersection and that removed the minimum from consideration. DH – One big update – Vyutas has run the sage rc for relative free energy calcs. I expect the data very soon. SB – DM chatted with Ant Nicholls at OE, and we’ve started a conversation about different ways to do the analysis. He’s backlogged now so we’ll hear from him later. JH – Nothing to report DD – We’re taking a two-pronged approach to the MM on the public spec. One way was to add additional specs on the existing industry dataset. We’ve run into difficulties on this approach because “engineX” needed a larger size limit. Now we’re hitting a new class of error, something deep inside the `requests` module. I’m trying to figure out why, but I don’t have a good conclusion for this yet. It’s likely a size/quantity limit. Basically, QC* is good at handling small-ish datasets But one thing is that collections need to get shuttled back and forth between client and server, and they grow over time. So now these are really big. The other problem is that there are factors that multiply the size/complexity of the objects, and we either end up with really large objects or tons of requests. So, I’ve been retrying these submissions a bunch of times, and sometimes they go through SB – So, the goal is to move toward a more paginated post? DD – Yes, so the aim is to be able to apply a spec to an entire collection. The molecules/coordinates are small, there’s no problem uploading them. The goal is to let the server take a spec and apply it to a whole collection. SB – I like fastAPI a lot, would recommend if it will plug in. DD – Agree, Doaa at MolSSI has been poking around with ways to resolve this but MolSSI is developer-time-constrained right now. DD – So I’m going to be automating attempting this submission into a loop, and hopefully these will go through. Currently some specs are running/complete already. DD – Roughly 3.5% error rate on the burn in set using sage rc1. Checked the method performance on the burn-in set, (shows RMSD, dE, and TFD plots) – No conclusions yet, set isn’t great scientifically. JW – Worked with DD on release prep, decided to delay pending Swope/Lucas analysis code and rc1 investigation Will start gathering refactor requirements soon e.g. from ad board meeting, partners would love a way to run Thomas Fox’s analysis on their mols How should we handle the rc1 optimization failures? New FF release or infrastructure changes? What’s root cause of the failures? We’ll want to see the final coordinates that the trajectory reached. LD – We’ll look at this in the next meeting. Next openff-benchmark release? want to get new analyses into new release to avoid partners needing to repeatedly do installs DD – Details of SB’s locally-run optimizations? SB – Ran on the “industry dataset 1.0”, which had the problematic implicit-H mols from Merck. The error rate didn’t seem hugely problematic. DD – I could try to run this on the current public set with the current benchmarking infrastructure. JW – This may not be necessary if we’ve already observed a similar error rate in the burn-in set. DD – I may try to run the sage rc1 jobs on QCA/locally anyway. DD – Can we further delay this new analysis from being in a release? DH – Gary will begin writing witht he current data, so the new analysis isn’t super critical. JW will email partners to tell them that we’re delaying the release pending further analysis of the release candidate and infrastructure.
Sage RC release?	Jeff	JW: assuming Sage RC gives between a 3.5% - 5% error rate, do we want to: put out a second release candidate? change how we are running optimizations? What is different from our approach and that used for testing by Simon? JW: changing how we are running optimizations to address Sage RC issues might look like we’re gaming things, open us up to criticism SB: we do need to understand why the failures are occurring; when running the ~60k public molecules myself did not see widespread failure SB: visualization of inidividual cases is next step; see if things are flying off into space, explode, etc. SB: if these are nitriles, don’t think parameters changed much from 1.3 values if it is functional groups, get report on which ones, will inform next steps JW: do we include the sage RC1 in `openff-benchmark` release? SB: recommend delay; gather data on what’s happening first given that we are observing problems ourselves JW: agree; happy to send an email to benchmarking partners to this effect [decision] we will delay next release of `openff-benchmark` SB: believe this is the process working, definitely pleased we’re doing this 😄
Season 1 retrospective		Given what we know now, having run Season 1, what could have been handled better/differently? JW: think we should have a dedicated call after Season 1 is finally through for this DD: sounds good, can do

Participants

Goals

Discussion topics

Action items

Decisions