2021-06-02 Benchmarking for Industry Partners - Development Meeting notes

Participants

@David Hahn
@Joshua Horton
@Lorenzo D'Amore
@David Dotson
@Simon Boothroyd
@Jeffrey Wagner

Goals

Needs for partner workshop
Public industry dataset status and v1.1
Mobley's idea of coverage differences between `gaff` and `openff`
Updates from team

Discussion topics

Item	Presenter	Notes

Item	Presenter	Notes
Updates from team		LD – WBOs, related to aniline series with para EWG/EDO. Got feedback and code snippets from Jessica Maat about it. Pushed analysis tool for Bill and Xavier’s analysis. some bugs to work out yet Contacted Hannah Bruce Macdonald about arsenic repo. Asked whether HBM has needs for other functionality/info re: best practices. DD – This is a good idea, to ask her for ideas/directions LD – Agree, I’m going to try to have a call on this. DH Reached out to Kaushik to ping about testing Schrodinger benchmarking commands. No reply yet. Christina Schindler also asked whether the schrodinger command tree is ready for used. I offered help/a call. LD and I talked with Gary about benchmarking. We probably want to plan a call soon. GT would like to have it in a publication, even if it doesn’t cover the whole dataset. DD – This makes sense. I think it’s time for a partner call. I’ll reach out to Gary. Next steps for schrodinger command tree? Wait for testing, or merge+release+announce? DD – Last week, we had proposed waiting for testing from Kaushik, then merging afterwards DH – Should we update openff-benchmark to use the new namespace? JW – I don’t think so. JW: could put the imports in `try…except` to accommodate namespace change; But this would be dangerous – It would leave the door open to folks updating their environments and possibly mangling their datasets by using different toolkit versions on the backend. JW – So, let’s keep season 1 branch totally separate from master until we finalize dataset. We can merge the changes into master on a later date. DH: Mobley question on coverage: “As part of the benchmarking efforts with industry are we collecting coverage data? There’s some interest in knowing whether we are covering more chemistry than GAFF/GAFF2 and I’m curious if that’s something we’re gathering stats on.” https://openforcefieldgroup.slack.com/archives/C8P8MLALD/p1622131786000700 LD: missing molecules in the output could be due to convergence failures or other failures; isn’t exclusive to coverage failures JW: not sure that just having access to a workflow that’s already run would tell us about `gaff` coverage – Molecules could have failed validation that would have been successfully handled in GAFF. DD – Even if we can’t do this on the current dataset, we could do it in a season 2. JH: thinking on ways we can get a coverage report for `gaff` SB: just because we cover more doesn’t mean we actually do in a meaningful sense; need to carefully define what coverage means, since we could have catchall params in the forcefield that do cover the elements, but are absolute garbage JW: think this definitely makes sense as a thrust for Season 2; definitely nuances here that would need to be addressed. We’d need to audit our workflow+logic to ensure that we fairly count coverage for each method. SB – It’d be good to make reports for “molecules that OpenFF can cover” and “molecules that only GAFF can cover” – It would be good to separate the final dataset like this to do finer-grained analysis. seeing where unparameterized molecules fall on the DD distribution would make it clear where garbage params are in the FF SB: for workshop, due to time constraints will utilize scripts I’ve been using, latest forcefield status; needs met from benchmarking perspective at this time So, idea is to do a compute expansion on public datasets to include MM optimizations? DD – We will start two series of MM runs – One from initial geometry, another from final geometry. SB – Doing the MM starting from FINAL QM geometry should be higher priority. Doing the MM starting from the INITIAL geometry is lower-priority. DD – I’ll try to get the fixed Merck set submitted today. Then I’ll take the QM endpoints and use them as a starting point for MM. I may put both together into a dataset and then prioritize the ones starting from final QM geometries DD – Do you need the MM datasets for the workshop? SB – I could use the GAFF ones. What is the plan for the MM optimizations? DD – “season 1:2” = GAFF, openffs SB – I’ll get all the numbers I need for the workshop locally. SB – Torsiondrive benchmarking – What are plans for season 2, int erms of capability and timing JW – Uncertain whether there willbe a season 2. Infrastructure planning tentatively reserves some of DD’s time in Oct+Nov to initiate. DD – We could begin speccing this at any time SB – Torsion profiles would be useful to implement, and I’d be interested in getting feedback from partners on “interesting torisons”. DD – There’s some question as to whether to have people use their season 1 set, or to provide more specific guidance on the molecule selection LD – Did we provide and guidelines for number of rotatable bonds in season 1? DD – We gave guidance on n_heavy_atoms, but not explicitly on n_rotatable_bonds LD – I wouldn’t restrict the season 2 dataset based on season 1 choices. JW: agree that Season 2 dataset shouldn’t be required to be the same as the Season 1 set computational cost alone will be high (~24x), so not much benefit to possibly being able to reuse the season 1 optimizations SB – Agree, we shouldn’t have any requirements on a season 2 dataset regarding its similarity to the season 1 DD – Agree, we don’t know why partners picked specific datasets, and whether those factors will still be important in several months. JH Nothing to report. JW: nothing additional to report; already discussed `schrodinger` branch merge into `season1` DD Speccing out PLBenchmarks on F@H. Not open for review yet, but will ask for feedback later. SB – Ideally, we’d want to be able to benchmark FFs that don’t have an official release. SB – So, could there be a way to submit this through something other than the main PLBenchmarks repo? Then each small repo could have a different FF DD – I think so, The work server would start up when there’s new work to do, and shut down otherwise. So this wouldn’t be suitable for submission from multiple repos. But if we left the work server on, this could work. SB – It would be great if the scheduling + approval could be handled on AWS. So a new job could come in and a maintainer could press “approve”. DD – I was just talking about starting the work server itself. SB – That sounds good. Could use heroku to wake on new jobs. DD – One challenge is the constraints of the work server – It requires host with a very large, performant file system. This is a constraint of the existing F@H infrastructure. SB - could put a REST API or frontend service on heroku that deals with managing the backend infrastructure

Action items

@David Dotson will create an issue for gaff vs. openff coverage metric (see Mobley question in #benchmarks); label for season 2

@David Hahn will continue coordinating partner testing of schrodinger command tree

@David Dotson will notify Gary that we are ready to schedule the next partner call for benchmarking

@David Dotson will add MM compute spec to v1.1 industry set

@David Dotson will factor in ability to benchmark unreleased FFs as part of PL benchmarking automation

Meetings

2021-06-02 Benchmarking for Industry Partners - Development Meeting notes

Participants

Goals

Discussion topics

Action items

Decisions