Training Dataset Selection

Objective

Develop tool for training data set selection for upcoming FF releases.

Objectives:

Determine training data set for upcoming May FF release (deadline Friday March 20, 2020)

Team

Team members: @Hyesu Jang @David Mobley @Jessica Maat (Deactivated) @Lee-Ping Wang @Jeffrey Wagner @Joshua Horton

Timeline to March 20th for data set submission

Timeline to March 20th for data set submission
Goal	Subgoal	Status & notes
A. Test case: create submission data set from a single data set Deadline: Fri March 6, 2020		in progress
	Start branch on QCA data set submission repository for Hyesu & Jessica to collaborate on Jupyter notebook
	Start with DLM’s jupyter notebook and work on clustering on smaller test case. Use graph similarity difference to choose most disparate molecules. @Jessica Maat (Deactivated) get code to work for example data set (Jupyter notebook: Optimization data sets generation 2 by jmaat · Pull Request #85 · openforcefield/qca-dataset-submission) @Jessica Maat (Deactivated)separate the procedure in the example notebook for torsion drive and optimization data set generation	Completed
	Jessica & Hyesu meet to discuss ideas for data set clustering (March 5th @ 1-2pm P.S.T.)	Complete Notes from meeting: Separate data filtering for torsion and optimization data set. Hyesu will torsion drive data set Jessica will focus on other terms bond and angle for optimization data set Roche set, discrepancy eMolecules, Pfizer, Bayer set - which data sets do we want to use?
	@Hyesu Jang research clustering method and similarity score metric	Complete Notes: Discussed current clustering with @David Mobley and he suggests using DBSCAN because of the following points: Do not need to specify number of clusters Can specify a cut off distance and allow for separation outliers Most other methods need to know the number of clusters, & choose a cutoff yourself, therefore DBSCAN reduces the amount of required human input values Clustering notes from @Jessica Maat (Deactivated) : I compared DBScan to K-means and Hierarchal agglomerative clustering: K-means does not allow for outliers as seamlessly as DBScan K-means creates non-globular shapes and variable densities because of its issues with handling outliers HAC can be sensitive to noise and outliers, favors only globular clusters, difficulty breaking large clusters. Once a decision is made using HAC it cannot be undone DBSCAN is a density based clustering algorithm DBScan is resistant to noise and can handle clusters of various shapes/sizes This is true when the correct input parameters are selected (EPS and min points). DBScan has limitations when clusters have varying densities and with high dimensional data
	@Hyesu Jang determine how to combine final filtered data sets (should we combine the filtered results from different data sets?)	Completed @Lee-Ping Wang suggests to combine the final filtered data sets to preserve individual chemical space.
	@Hyesu Jang implement code for torsion drive data set and filtering out rotatable bonds for torsion drive data set - Code link: Optimization data sets generation 2 by jmaat · Pull Request #85 · openforcefield/qca-dataset-submission
	Jessica & Hyesu meet March 6th @ 12 pm PST to discuss updates	Complete When selecting a molecule from the cluster, currently Jessica is selecting smallest molecule from the cluster. Should update code to consider choosing a molecule that is more representative of the cluster, for example the centroid.
	Jessica finish example optimization DS Select centroid molecule of the cluster There is no centroid using DBScan
	Hyesu finish torsion drive data set filter/generation by 3-8-20	Done
B. Filter molecules & submit data sets to QCA: Run the code in Jupyter notebook for all training data sets Deadline: Fri March 13, 2020
	Data sets to filter (suggestions from @David Mobley & @Hyesu Jang ) Submit & filtered these data sets as soon as they are processed: High priority = orange Lower priority = green Bayer data set Roche eMolecules discrepancy set Coverage set Pfizer 100 fragment discrepancy set SiliconTx [tentative, check with Daniel if this is finished.] NCI250k [tentative] DrugBank FDA drugs Drugbank: Potential dataset: Drugbank all (13K molecules) · Issue #17 · openforcefield/qca-dataset-submission Maybe Sellers fragment set:Use Sellers fragment set as test · Issue #63 · openforcefield/qca-dataset-submission	Notes: Check the coverage of the DS before filtering molecules. Make a heat map to check similarity between DS.
	Meeting March 18th (Jeff, Hyesu, Jessica)	Selection of smallest molecule, DBScan does not have centroid Generation of .json input, should I do separate .jsons of filtered sets for each data set or a single combined json? Single directory or all filtered sets in separate directories? How are we naming data sets? Should I set a single eps and min samples or scale them based on filter results? MACCS based on smarts, LINGO based on smiles ~2,000 optimization molecules for previous fitting Meeting notes- Clustering based on parameters and want maximum chemical diversity based on parameters Use single eps and min samples value and after clustering look at the number of molecules for each parameter. If # is too large or small, adjust and rerun. eps =0.5 and min numbers of samples= 2 Less than 3 molecules, select randomly Dataset formatting/organization HJ – I separate by the source of each dataset to keep chemical space of each input molecule set separate. HJ – Dataset organization Best to separate by original dataset--Bayer/Roche discrepancy set, pfizer set… For GH repo, I make sure to include relevant ipython notebooks, and include utils-<something>, which includes handy functions (like to generate json) JW – also include output of `conda env export > environment_exact.yml`
Finish all data sets by March 20th. Deadline: Fri March 20, 2020

Notes:

Goal of the project is to make use of parameters in training data set more evenly and broaden the chemistry we are fitting to
Finalized data set should contain around the same order of magnitude of target molecules as in recently Parsley fit
If filtering data sets individually and then combining results leads to too large of a data set, we can try to combine data sets prior to performing filtering to reduce training data set size