2021-07-21 Industry benchmarks meeting notes

Participants

  • @Simon Boothroyd

  • @David Dotson

  • @Lorenzo D'Amore

  • @Jeffrey Wagner

  • @David Hahn

  • @Joshua Horton

Goals

  • Sage performance

  • Updates from team

  • openff-benchmark release blockers

  • Season 1 retrospective

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

Updates from team

 

  • DH: failing Sage RC optimizations

    • I looked manually, failing cases had a C#C, C#N, cubane.

    • DD – Did we see the same problem with earlier FFs?

      • DH – No, for the others, less than 1% of optimizations failed.

      • LD – With sage rc1 the failures are around 5%.

    • DH: also checked successful optimizations, there are also C#N cases; so, not all C#N appear to be failing

    • DD – I’m also curious how the result plots are looking.

    • LD – (shows plots, sage rc1 is better than 1.2.1 in TFD, but worse in RMSD and dE)

    • DH – This comparison may be biased by bin size. It may be good to add standard deviations or something like that.

    • DD: looks like we may have recovered some of the RMSD accuracy we lost between 1.2.1 and 1.3.0; gained quite a bit of accuracy in torsion prediction

    • SB: have to take result with a grain of salt, even if encouraging; hard to tell if we’re getting a few really good cases and a number of slightly worse ones, for example

    • JW – Would be good to have cumulative distribution plots, could use small bins in that case.

    • DH – Good idea. Reviewers on the Lim paper asked for smaller bins as well.

  • LD – Yesterday, I had a call with DH on refactor plans. There was one cases where a minimum wasn’t “caught”, and it was because we took the intersection and that removed the minimum from consideration.

  • DH – One big update – Vyutas has run the sage rc for relative free energy calcs. I expect the data very soon.

  • SB – DM chatted with Ant Nicholls at OE, and we’ve started a conversation about different ways to do the analysis. He’s backlogged now so we’ll hear from him later.

  • JH – Nothing to report

  • DD – We’re taking a two-pronged approach to the MM on the public spec.

    • One way was to add additional specs on the existing industry dataset. We’ve run into difficulties on this approach because “engineX” needed a larger size limit. Now we’re hitting a new class of error, something deep inside the requests module. I’m trying to figure out why, but I don’t have a good conclusion for this yet. It’s likely a size/quantity limit.

      • Basically, QC* is good at handling small-ish datasets But one thing is that collections need to get shuttled back and forth between client and server, and they grow over time. So now these are really big.

      • The other problem is that there are factors that multiply the size/complexity of the objects, and we either end up with really large objects or tons of requests.

      • So, I’ve been retrying these submissions a bunch of times, and sometimes they go through

    • SB – So, the goal is to move toward a more paginated post?

      • DD – Yes, so the aim is to be able to apply a spec to an entire collection. The molecules/coordinates are small, there’s no problem uploading them. The goal is to let the server take a spec and apply it to a whole collection.

      • SB – I like fastAPI a lot, would recommend if it will plug in.

      • DD – Agree, Doaa at MolSSI has been poking around with ways to resolve this but MolSSI is developer-time-constrained right now.

    • DD – So I’m going to be automating attempting this submission into a loop, and hopefully these will go through. Currently some specs are running/complete already.

    • DD – Roughly 3.5% error rate on the burn in set using sage rc1.

    • Checked the method performance on the burn-in set, (shows RMSD, dE, and TFD plots) – No conclusions yet, set isn’t great scientifically.

  • JW –

    • Worked with DD on release prep, decided to delay pending Swope/Lucas analysis code and rc1 investigation

    • Will start gathering refactor requirements soon

      • e.g. from ad board meeting, partners would love a way to run Thomas Fox’s analysis on their mols

  • How should we handle the rc1 optimization failures? New FF release or infrastructure changes?

    • What’s root cause of the failures?

      • We’ll want to see the final coordinates that the trajectory reached.

      • LD – We’ll look at this in the next meeting.

  • Next openff-benchmark release?

    • want to get new analyses into new release to avoid partners needing to repeatedly do installs

  • DD – Details of SB’s locally-run optimizations?

    • SB – Ran on the “industry dataset 1.0”, which had the problematic implicit-H mols from Merck. The error rate didn’t seem hugely problematic.

    • DD – I could try to run this on the current public set with the current benchmarking infrastructure.

    • JW – This may not be necessary if we’ve already observed a similar error rate in the burn-in set.

    • DD – I may try to run the sage rc1 jobs on QCA/locally anyway.

  • DD – Can we further delay this new analysis from being in a release?

    • DH – Gary will begin writing witht he current data, so the new analysis isn’t super critical.

    •  

  • JW will email partners to tell them that we’re delaying the release pending further analysis of the release candidate and infrastructure.

Sage RC release?

Jeff

  • JW: assuming Sage RC gives between a 3.5% - 5% error rate, do we want to:

    • put out a second release candidate?

    • change how we are running optimizations? What is different from our approach and that used for testing by Simon?

  • JW: changing how we are running optimizations to address Sage RC issues might look like we’re gaming things, open us up to criticism

  • SB: we do need to understand why the failures are occurring; when running the ~60k public molecules myself did not see widespread failure

    • SB: visualization of inidividual cases is next step; see if things are flying off into space, explode, etc.

    • SB: if these are nitriles, don’t think parameters changed much from 1.3 values

      • if it is functional groups, get report on which ones, will inform next steps

  • JW: do we include the sage RC1 in openff-benchmark release?

    • SB: recommend delay; gather data on what’s happening first given that we are observing problems ourselves

    • JW: agree; happy to send an email to benchmarking partners to this effect

    • [decision] we will delay next release of openff-benchmark

    • SB: believe this is the process working, definitely pleased we’re doing this

Season 1 retrospective



  • Given what we know now, having run Season 1, what could have been handled better/differently?

    • JW: think we should have a dedicated call after Season 1 is finally through for this

    • DD: sounds good, can do

Action items

@David Dotson will continue working to get MM additions to industry benchmark set, past errors
@David Dotson will create MM industry benchmark set from end of QM
@Lorenzo D'Amore will finish up swope and lucas analyses, work with @David Dotson for PR merge
@David Hahn will share results of Vyutas' free energy calculations with Sage
@Lorenzo D'Amore will report on Sage behavior, results of more detailed analysis of trajectories

Decisions