Science Roadmap 2020 (working version)

This scientific roadmap includes the next two planned force field releases and a list of scientific studies which need to be performed in 2020. Each study has a priority assigned to it. This roadmap can be continuously updated, but the overall status and priorities will be revised and updated in June 2020.

Force Fields

Upcoming force field versions:

Version	Codename	Features	Expected release date	Comment / Blocker
`openff-1.2.0`	Parsley	Redesigned QM dataset for parameterization with better/broader coverage Parameter fixes	May 2020 / June 2020
`openff-2.0.0`	Sage	LJ refit (based on the ongoing feasibility study) Limited WBO torsion interpolation for systems for which data already exists (more torsional data needed for a wide range application)	Late 2020 (November)	How fast can we do WBO interpolations (Pavan) Simon Boothroyd needs to get in touch with David Hahn and folks from the Chodera lab to discuss some free energy benchmarking after LJ fitting Late 2020/early 2021 still feasible

Scientific studies

The list of scientific studies which need to be performed in 2020, which will be updated every 3 months, as suggested in the science project management workflow. Each study should be linked to its Confluence page with more information about study design, execution and results. The study design should be submitted before study is about to begin.

Estimate start dates and end dates when possible before study has started. Record the real start and end dates for each study below the estimated dates.

Labels

Category	Labels
Priority	HIGH \| MEDIUM \| LOW
Effort	HIGH \| MEDIUM \| LOW
Status	NOT STARTED \| IN PROGRESS \| PROTOTYPE \| COMPLETED \| BLOCKED \|

Study	Priority	Effort	Science dependencies	Infrastructure dependencies	Comment	Start date	End date	Status	Driver/Team
Chemical perception
Addition of new parameters – manually fixing problems	HIGH	HIGH		Made easier by benchmarking dashboard (Optional)	Made easier by benchmarking dashboard (Optional			IN PROGRESS	Hyesu Jang David Mobley Jessica Maat (Deactivated) Victoria Lim (Deactivated)
Automated typing inference from scratch	MEDIUM	HIGH			Organise a meeting to coordinate efforts. Update: Tobias Huefner is doing some basic research, but we don’t have a timeline defined here. Perhaps a more specific study to look at typing issues similar to Schauperl’s work on LJ typing.			IN PROGRESSSlowly	Full-time person needed – to be discussed further. Work of Josh Fass (Deactivated) and Tobias Huefner may assist here. Owen Madin interested. Trevor Gokey is also actively working in this area.
Mixture Properties
Binary Mixture Data Feasibility Study	HIGH				In the writing stage.			COMPLETED	Driver: Simon Boothroyd Team: Michael Shirts Owen Madin
Non-bonded optimization	HIGH	HIGH			Parent study for in a long-term progress stage.			IN PROGRESS	Driver:Simon Boothroyd Team: Michael Shirts Owen Madin
Chemical potential-like properties	MEDIUM		Non-bonded optimization	Implementation in `Evaluator`	Need to evaluate the data first (testing needed). Add Confluence page here.			IN PROGRESS PROTOTYPE	Simon Boothroyd SPINOFF
Solvent-solvent partition coefficients	MEDIUM			Implementation in `Evaluator`	Data needed, harder problem. Update: Access to solubility phase, data is less of a problem now (MNSOL)			NOT STARTED	Simon Boothroyd SPINOFF
Data coverage and availability	HIGH		Feasibility studies		Check the available data and identify missing data points. Worry in the future what to do about it. We will use what we have for Sage.			Ongoing	Simon Boothroyd Owen Madin Michael Shirts
QM Data Generation
QM dataset selection (training data) for OpenFF-1.2.0	HIGH				Need to expand to benchmarking set.			COMPLETED	David Mobley Jessica Maat (Deactivated) Hyesu Jang
QM dataset selection for OpenFF-2.0.0	HIGH							IN PROGRESS	David Mobley Jessica Maat (Deactivated) Hyesu Jang
Benchmarking/re-evaluating our choice of QM theory	HIGH			(Optional) QC Dataset submission infrastructure	Test of the whole torsiondrive. Keep within 10-50 torsiondrives. More is better. Some datasets ready, but analysis is still required (Hyesu Jang ) Pavan might help with this.			IN PROGRESS	Hyesu Jang lead; Lee-Ping Wang Hyesu Jang also leading molecule set selection with help from Jessica Maat (Deactivated) and Victoria Lim (Deactivated)
Protomer/tautomer enumerated molecules	HIGH		QM level of theory validation (QMLoTV)	Protonation/tautomer enumeration integration (Joshua Horton doing OE version in toolkit; there’s currently no good protonation state enumeration with RDKit – see )	It can only do enumeration with OpenEye			PROTOTYPED	Joshua Horton
Data selection for ionic species					What kind of experimental data would we need to include charged molecules?			NOT STARTED
Data on molecules with nonzero formal charges	HIGH		QM level of theory validation (QMLoTV)	(Optional) QC Dataset submission infrastructure				NOT STARTED	Pavan
Enamine REAL fragment coverage	MEDIUM			Automated fragmentation integration Joshua Horton				IN PROGRESS	Trevor Gokey
Ligand Expo fragment coverage	MEDIUM			Automated fragmentation integration Joshua Horton	Ligand Expo has higher priority than Enamine Real.			NOT STARTED
Richer torsion data for WBO fitting	LOW			WBO torsion implementation	What data to generate and				(person needed to continue work of Chaya Stern (Deactivated); probably Pavan with input from Jessica Maat (Deactivated) or vise versa. Overseen by Simon Boothroyd ? )
Biopolymer data selection (ensure sidechain data is available in QCA)	HIGH				One dataset ready, but a lot more data needs to be generated if we want sidechain sampling			IN PROGRESS	David Cerutti (Deactivated)
Biopolymer data computation	MEDIUM			(Optional) QC Dataset submission infrastructure				IN PROGRESS	David Cerutti (Deactivated) David Dotson
More efficient torsion sampling with less grid points during scan	LOW								SPINOFF
Fitting
Addition of new parameters – manually fixing problems	HIGH	HIGH				Ongoing		IN PROGRESS	Hyesu Jang David Mobley Jessica Maat (Deactivated) Victoria Lim (Deactivated)
LJ refitting (Sage)	HIGH		Non-bonded optimization					IN PROGRESS	Simon Boothroyd and Owen Madin
WBO refitting (Sage)	HIGH		More torsion data	WBO torsion implementation. Done.	Implement what Chaya has already done. As soon as infrastructure is ready. Done.	After May meeting	Late 2020 (Sep 2020)	IN PROGRESS	Jessica Maat (Deactivated) Hyesu Jang Pavan
BCC refitting	HIGH		LJ refit Patterns for BCCs; could start with something simple like bond SMARTS.	ChargeIncrementModel implementation (early May)				IN PROGRESS	Simon Boothroyd Owen Madin
Study how to set prior widths and weights for different sorts of data during FF optimization	LOW								Lee-Ping Wang Hyesu Jang Spinoff?
Value of data generated “incidentally” during torsiondrive in fitting, e.g. optimization snapshots, gradients, energies (low control over these data points)	LOW			Some parts of Bespoke workflow	Once we have more people working on fitting, someone can run this study				Joshua Horton SPINOFF
Benchmarking
Small reference system for fast testing of FE infrastructure – 5-10 small reference systems, possibly subset of SAMPL challenges, for comparison of different free energy methods to avoid using large P-L systems for test calculations	LOW	LOW			Should use SAMPLing challenge systems plus a couple more similar ones.		ASAP	NOT STARTED
Benchmarking/re-evaluating our choice of QM theory	HIGH	MEDIUM						NOT STARTED	Lee-Ping Wang
CCDC data selection/release	LOW								SPINOFF
Create a list of tests to judge the “quality” of biopolymer FF with our scientific advisory board	HIGH				Organise the meeting with our IAB, invite to May meeting. Done. DC and MS will start conversations to get this going.		April / May	IN PROGRESS	David Cerutti (Deactivated) Michael Shirts
`openff-1.2.0` (Parsley) benchmarking			Minor release of Parsley	Benchmarking dashboard	Done in preprint form, but no benchmarking dashboard. Still need torsion benchmarking; utilize work just done for OpenFF 1.0 paper. JDC is trying to get a complete FE set run by D. Rufa.		Mid 2020	Done-ish
`openff-2.0.0` (Sage) benchmarking			Release of Sage	Benchmarking dashboard			Late 2020	NOT STARTED
Biopolymers
Which quantum method should we use for biopolymers (should it be the same as small molecules)?	MEDIUM		QM benchmarking study		Short term – using the same method and same level of theory as ANI (wB97D)			NOT STARTED	Lee-Ping Wang David Cerutti (Deactivated)
Feasibility/benchmarking studies of torsional CMAPs	MEDIUM		After protein FF implementation	CMAP support in OFFTK				NOT STARTED	David Cerutti (Deactivated)
Feasibility/benchmarking studies of other cross-terms	LOW			Support for cross-terms in OFFTK	MS – Importance of cross-terms will be related to a number of types			NOT STARTED
Charges
GCN charge model	HIGH				In a few steps: conda-installable tool to assign charges integration of tool into OFFTK under ChargeIncrementModel keyword (and exposure of relevant keywords)			IN PROGRESS	John Chodera Yuanqing Wang Josh Fass (Deactivated) (maybe John Herr)
Off-site charge SMIRKS definition/fitting/benchmarking	MEDIUM	HIGH		VirtualSite support in OFFTK	Helpful discussion in Slack: https://openforcefieldgroup.slack.com/archives/C1907SGET/p1590251452068100 Infrastructure expected in September 2020			NOT STARTED	SPINOFF (but interface with David Cerutti (Deactivated) work?)
Bayesian inference and surrogate modeling
Testing Bayesian inference on an analytical model	MEDIUM	LOW			Nearing completion			IN PROGRESS	Owen Madin
Generalizing analytical model for Bayesian inference and testing methods	LOW	MEDIUM			Proof-of-concept work to give us an analytical form for early testing			IN PROGRESSslower	Owen Madin (and a student)
Constructing full Bayesian architecture with reweighting and simulation to build surrogate models	LOW	HIGH	Analytical Bayesian inference testing		ForceBalance → pytorch, torchMD (timemachine)			NOT STARTED	John Herr Owen Madin (science, not software)
Automated typing inference from scratch	HIGH	HIGH			Bayesian-based typing (Josh Fass’s work)				Josh Fass (Deactivated) → Tobias Huefner
Other
Water co-optimization planning study (to be executed later) – discuss with Lee-Ping Wang	LOW	HIGH			Lack of bandwidth, potentially Bill Swope could help advise with data selection.				SPINOFF
Thinking about metals / ions / salts	LOW	HIGH			Biologically relevant, will become high priority at some point				SPINOFF
Thinking about ionic liquids
Alchemical force fields (for alchemical free energies)	LOW/MEDIUM				Soft core potentials. JDC might have people in his lab working on it, MS is interested to join the effort.
Continuous (smearnoff) typing					ESPALOMA				Yuanqing Wang