Training Dataset Selection

Objective

Develop tool for training data set selection for upcoming FF releases.

Objectives:

  • Determine training data set for upcoming May FF release (deadline Friday March 20, 2020)

Team

Team members: @Hyesu Jang @David Mobley @Jessica Maat (Deactivated) @Lee-Ping Wang @Jeffrey Wagner @Joshua Horton

Timeline to March 20th for data set submission

 

 

Timeline to March 20th for data set submission

 

 

Goal

Subgoal

Status & notes

A. Test case: create submission data set from a single data set

Deadline: Fri March 6, 2020

 

in progress

 

Start branch on QCA data set submission repository for Hyesu & Jessica to collaborate on Jupyter notebook

 

 

Start with DLM’s jupyter notebook and work on clustering on smaller test case. Use graph similarity difference to choose most disparate molecules.

@Jessica Maat (Deactivated) get code to work for example data set (Jupyter notebook: https://github.com/openforcefield/qca-dataset-submission/pull/85)
@Jessica Maat (Deactivated)separate the procedure in the example notebook for torsion drive and optimization data set generation

Completed

 

Jessica & Hyesu meet to discuss ideas for data set clustering (March 5th @ 1-2pm P.S.T.)

Complete

Notes from meeting:

Separate data filtering for torsion and optimization data set.

Hyesu will torsion drive data set

Jessica will focus on other terms bond and angle for optimization data set

Roche set, discrepancy eMolecules, Pfizer, Bayer set - which data sets do we want to use?

 

@Hyesu Jang research clustering method and similarity score metric

Complete

Notes:

Discussed current clustering with @David Mobley and he suggests using DBSCAN because of the following points:

  • Do not need to specify number of clusters

  • Can specify a cut off distance and allow for separation outliers

  • Most other methods need to know the number of clusters, & choose a cutoff yourself, therefore DBSCAN reduces the amount of required human input values

Clustering notes from @Jessica Maat (Deactivated) :

  • I compared DBScan to K-means and Hierarchal agglomerative clustering:

    • K-means does not allow for outliers as seamlessly as DBScan

    • K-means creates non-globular shapes and variable densities because of its issues with handling outliers

    • HAC can be sensitive to noise and outliers, favors only globular clusters, difficulty breaking large clusters.

    • Once a decision is made using HAC it cannot be undone

    • DBSCAN is a density based clustering algorithm

    • DBScan is resistant to noise and can handle clusters of various shapes/sizes

      • This is true when the correct input parameters are selected (EPS and min points).

    • DBScan has limitations when clusters have varying densities and with high dimensional data

 

 

@Hyesu Jang determine how to combine final filtered data sets (should we combine the filtered results from different data sets?)

Completed

@Lee-Ping Wang suggests to combine the final filtered data sets to preserve individual chemical space.

 

@Hyesu Jang implement code for torsion drive data set and filtering out rotatable bonds for torsion drive data set - Code link: https://github.com/openforcefield/qca-dataset-submission/pull/85

 

 

Jessica & Hyesu meet March 6th @ 12 pm PST to discuss updates

Complete

When selecting a molecule from the cluster, currently Jessica is selecting smallest molecule from the cluster. Should update code to consider choosing a molecule that is more representative of the cluster, for example the centroid.

 

 

 

 

Jessica finish example optimization DS

Select centroid molecule of the cluster There is no centroid using DBScan

 

 

Hyesu finish torsion drive data set filter/generation by 3-8-20

Done

B. Filter molecules & submit data sets to QCA: Run the code in Jupyter notebook for all training data sets

Deadline: Fri March 13, 2020

 

 

 

Data sets to filter (suggestions from @David Mobley & @Hyesu Jang )

Submit & filtered these data sets as soon as they are processed:

High priority = orange

Lower priority = green

Bayer data set
Roche
eMolecules discrepancy set
Coverage set
Pfizer 100 fragment discrepancy set
SiliconTx [tentative, check with Daniel if this is finished.]
NCI250k [tentative]
DrugBank FDA drugs

Notes: Check the coverage of the DS before filtering molecules. Make a heat map to check similarity between DS.

 

Meeting March 18th (Jeff, Hyesu, Jessica)

  • Selection of smallest molecule, DBScan does not have centroid

  • Generation of .json input, should I do separate .jsons of filtered sets for each data set or a single combined json?

    • Single directory or all filtered sets in separate directories?

  • How are we naming data sets?

  • Should I set a single eps and min samples or scale them based on filter results?

  • MACCS based on smarts, LINGO based on smiles

  • ~2,000 optimization molecules for previous fitting

 

Meeting notes-

  • Clustering based on parameters and want maximum chemical diversity based on parameters

  • Use single eps and min samples value and after clustering look at the number of molecules for each parameter. If # is too large or small, adjust and rerun.

  • eps =0.5 and min numbers of samples= 2

  • Less than 3 molecules, select randomly

Dataset formatting/organization

  • HJ – I separate by the source of each dataset to keep chemical space of each input molecule set separate.

  • HJ – Dataset organization

    • Best to separate by original dataset--Bayer/Roche discrepancy set, pfizer set…

    • For GH repo, I make sure to include relevant ipython notebooks, and include utils-<something>, which includes handy functions (like to generate json)

    • JW – also include output of conda env export > environment_exact.yml

Finish all data sets by March 20th.

Deadline: Fri March 20, 2020

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Notes:

  • Goal of the project is to make use of parameters in training data set more evenly and broaden the chemistry we are fitting to

  • Finalized data set should contain around the same order of magnitude of target molecules as in recently Parsley fit

  • If filtering data sets individually and then combining results leads to too large of a data set, we can try to combine data sets prior to performing filtering to reduce training data set size