Science Roadmap 2020 (working version)

This scientific roadmap includes the next two planned force field releases and a list of scientific studies which need to be performed in 2020. Each study has a priority assigned to it. This roadmap can be continuously updated, but the overall status and priorities will be revised and updated in June 2020.

Force Fields

Upcoming force field versions:

Version	Codename	Features	Expected release date
`openff-1.2.0`	Parsley	Redesigned QM dataset for parameterization with better/broader coverage Parameter fixes	May 2020
`openff-2.0.0`	Sage	LJ refit (based on the ongoing feasibility study) Limited WBO torsion interpolation for systems for which data already exists (more torsional data needed for a wide range application)	Late 2020 (November)

Scientific studies

The list of scientific studies which need to be performed in 2020, which will be updated every 3 months, as suggested in the science project management workflow. Each study should be linked to its Confluence page with more information about study design, execution and results. The study design should be submitted before study is about to begin.

Estimate start dates and end dates when possible before study has started. Record the real start and end dates for each study below the estimated dates.

Labels

Category	Labels
Priority	HIGH \| MEDIUM \| LOW
Effort	HIGH \| MEDIUM \| LOW
Status	NOT STARTED \| IN PROGRESS \| COMPLETED

Study	Priority	Effort	Science dependencies	Infrastructure dependencies	Comment	Start date	End date	Status	Driver/Team
Chemical perception
Addition of new parameters – manually fixing problems	HIGH	HIGH		Made easier by benchmarking dashboard (Optional)	Made easier by benchmarking dashboard (Optional			IN PROGRESS	Hyesu Jang David Mobley Jessica Maat (Deactivated) Victoria Lim (Deactivated)
Automated typing inference from scratch	MEDIUM	HIGH			Organise a meeting to coordinate efforts.				Full-time person needed – to be discussed further. Work of Josh Fass (Deactivated) and Tobias Huefner may assist here. Owen Madin interested.
Mixture Properties
Binary Mixture Data Feasibility Study	HIGH							IN PROGRESS	Driver: Simon Boothroyd Team: Michael Shirts Owen Madin
Non-bonded optimization	HIGH	HIGH						IN PROGRESS	Driver:Simon Boothroyd Team: Michael Shirts Owen Madin
Chemical potential-like properties	MEDIUM		Non-bonded optimization	Implementation in `Evaluator`	Need to evaluate the data first (testing needed)			IN PROGRESS	Simon Boothroyd Spinoff (student)
Octanol-water partition coefficients	MEDIUM			Implementation in `Evaluator`	Data needed, harder problem			NOT STARTED	Simon Boothroyd spinoff (Student)
Data coverage and availability	HIGH		Feasibility studies		Check the available data and identify missing data points. Worry in the future what to do about it. We will use what we have for Sage.			Ongoing	Simon Boothroyd Owen Madin Michael Shirts
QM Data Generation
QM dataset selection (training data)	HIGH				Need to expand to benchmarking set.			IN PROGRESS	David Mobley Jessica Maat (Deactivated) Hyesu Jang
Benchmarking/re-evaluating our choice of QM theory	HIGH			(Optional) QC Dataset submission infrastructure	Test of the whole torsiondrive. Keep within 10-50 torsiondrives. More is better.			IN PROGRESS	Hyesu Jang lead; Lee-Ping Wang Hyesu Jang also leading molecule set selection with help from Jessica Maat (Deactivated) and Victoria Lim (Deactivated)
Protomer/tautomer enumerated molecules	HIGH		QM level of theory validation (QMLoTV)	Protonation/tautomer enumeration integration (Joshua Horton doing OE version in toolkit; there’s currently no good protonation state enumeration with RDKit – see )
Data on molecules with nonzero formal charges	HIGH		QM level of theory validation (QMLoTV)	(Optional) QC Dataset submission infrastructure
Enamine REAL fragment coverage	MEDIUM			Automated fragmentation integration (Joshua Horton					Trevor Gokey
Ligand Expo fragment coverage	MEDIUM			Automated fragmentation integration (Joshua Horton	Ligand Expo has higher priority than Enamine Real.
Richer torsion data for WBO fitting	LOW			WBO torsion implementation					(person needed to continue work of Chaya Stern (Deactivated); probably Pavan with input from Jessica Maat (Deactivated) or vise versa. Overseen by Simon Boothroyd ? )
Biopolymer data selection (ensure sidechain data is available in QCA)	HIGH					ASAP			David Cerutti (Deactivated)
Biopolymer data computation	MEDIUM			(Optional) QC Dataset submission infrastructure					David Cerutti (Deactivated) David Dotson
More efficient torsion sampling with less grid points during scan	LOW								SPINOFF
Fitting
Addition of new parameters – manually fixing problems	HIGH	HIGH				Ongoing		IN PROGRESS	Hyesu Jang David Mobley Jessica Maat (Deactivated) Victoria Lim (Deactivated)
LJ refitting (Sage)	HIGH		Non-bonded optimization					IN PROGRESS	Simon Boothroyd and Owen Madin
WBO refitting (Sage)	HIGH		More torsion data	WBO torsion implementation	Implement what Chaya has already done. As soon as infrastructure is ready.	After May meeting	Late 2020 (Sep 2020)		Jessica Maat (Deactivated) Hyesu Jang Someone else to continue where Chaya left it off
BCC refitting	HIGH		LJ refit Patterns for BCCs; could start with something simple like bond SMARTS.	ChargeIncrementModel implementation (early May)					Person needed (SPINOFF) David Mobley can help
Study how to set prior widths and weights for different sorts of data during FF optimization	LOW								Lee-Ping Wang Hyesu Jang Spinoff?
Value of data generated “incidentally” during torsiondrive in fitting, e.g. optimization snapshots, gradients, energies (low control over these data points)	LOW			Some parts of Bespoke workflow					Joshua Horton SPINOFF
Benchmarking
Small reference system for fast testing of FE infrastructure – 5-10 small reference systems, possibly subset of SAMPL challenges, for comparison of different free energy methods to avoid using large P-L systems for test calculations	HIGH	LOW			Should use SAMPLing challenge systems plus a couple more similar ones.		ASAP	NOT STARTED	David Mobley Michael Gilson John Chodera David Hahn – owner
Benchmarking/re-evaluating our choice of QM theory	HIGH	MEDIUM						NOT STARTED	Lee-Ping Wang
CCDC data selection/release	LOW								SPINOFF
Create a list of tests to judge the “quality” of biopolymer FF with our scientific advisory board	MEDIUM				Organise the meeting with our IAB, invite to May meeting		April / May		David Cerutti (Deactivated)
`openff-1.2.0` (Parsley) benchmarking			Minor release of Parsley	Benchmarking dashboard	Done in preprint form, but no benchmarking dashboard. Still need torsion benchmarking; utilize work just done for OpenFF 1.0 paper.		Mid 2020	Done-ish
`openff-2.0.0` (Sage) benchmarking			Release of Sage	Benchmarking dashboard			Late 2020
Biopolymers
Which quantum method should we use for biopolymers (should it be the same as small molecules)?	MEDIUM		QM benchmarking study						Lee-Ping Wang David Cerutti (Deactivated)
Feasibility/benchmarking studies of torsional CMAPs	MEDIUM		After protein FF implementation	CMAP support in OFFTK					David Cerutti (Deactivated)
Feasibility/benchmarking studies of other cross-terms	LOW			Support for cross-terms in OFFTK
Charges
GCN charge model	HIGH				In a few steps: conda-installable tool to assign charges integration of tool into OFFTK under ChargeIncrementModel keyword (and exposure of relevant keywords)			IN PROGRESS	John Chodera Yuanqing Wang
Off-site charge SMIRKS definition/fitting/benchmarking	MEDIUM	HIGH		VirtualSite support in OFFTK	Helpful discussion in Slack: https://openforcefieldgroup.slack.com/archives/C1907SGET/p1590251452068100				SPINOFF (but interface with David Cerutti (Deactivated) work?)
Bayesian inference and surrogate modeling
Testing Bayesian inference on an analytical model	MEDIUM	LOW						IN PROGRESS	Owen Madin
Generalizing analytical model for Bayesian inference and testing methods	MEDIUM	MEDIUM			We don’t need Bayesian framework to work immediately			IN PROGRESS	Simon Boothroyd
Constructing full Bayesian architecture with reweighting and simulation to build surrogate models	MEDIUM	HIGH	Analytical Bayesian inference testing					NOT STARTED	Simon Boothroyd Owen Madin Matt Thompson ?
Automated typing inference from scratch	HIGH	HIGH							Full-time person needed – to be discussed further. Work of Josh Fass (Deactivated) and Tobias Wulsdorf may assist here.
Other
Water co-optimization planning study (to be executed later) – discuss with Lee-Ping Wang	LOW	HIGH							spinoff
Thinking about metals / ions / salts / ionic liquids	LOW	HIGH							Owen Madin Matt Thompson spinoff