Training Dataset Selection
ObjectiveDevelop tool for training data set selection for upcoming FF releases. Objectives:
| TeamTeam members: @Hyesu Jang @David Mobley @Jessica Maat (Deactivated) @Lee-Ping Wang @Jeffrey Wagner @Joshua Horton |
Timeline to March 20th for data set submission |
|
|
---|---|---|
Goal | Subgoal | Status & notes |
A. Test case: create submission data set from a single data set Deadline: Fri March 6, 2020 |
| in progress |
| Start branch on QCA data set submission repository for Hyesu & Jessica to collaborate on Jupyter notebook |
|
| Start with DLM’s jupyter notebook and work on clustering on smaller test case. Use graph similarity difference to choose most disparate molecules. @Jessica Maat (Deactivated) get code to work for example data set (Jupyter notebook: https://github.com/openforcefield/qca-dataset-submission/pull/85) @Jessica Maat (Deactivated)separate the procedure in the example notebook for torsion drive and optimization data set generation | Completed |
| Jessica & Hyesu meet to discuss ideas for data set clustering (March 5th @ 1-2pm P.S.T.) | Complete Notes from meeting: Separate data filtering for torsion and optimization data set. Hyesu will torsion drive data set Jessica will focus on other terms bond and angle for optimization data set Roche set, discrepancy eMolecules, Pfizer, Bayer set - which data sets do we want to use? |
| @Hyesu Jang research clustering method and similarity score metric | Complete Notes: Discussed current clustering with @David Mobley and he suggests using DBSCAN because of the following points:
Clustering notes from @Jessica Maat (Deactivated) :
|
| @Hyesu Jang determine how to combine final filtered data sets (should we combine the filtered results from different data sets?) | Completed @Lee-Ping Wang suggests to combine the final filtered data sets to preserve individual chemical space. |
| @Hyesu Jang implement code for torsion drive data set and filtering out rotatable bonds for torsion drive data set - Code link: https://github.com/openforcefield/qca-dataset-submission/pull/85 |
|
| Jessica & Hyesu meet March 6th @ 12 pm PST to discuss updates | Complete When selecting a molecule from the cluster, currently Jessica is selecting smallest molecule from the cluster. Should update code to consider choosing a molecule that is more representative of the cluster, for example the centroid.
|
| Jessica finish example optimization DS Select centroid molecule of the cluster There is no centroid using DBScan |
|
| Hyesu finish torsion drive data set filter/generation by 3-8-20 | Done |
B. Filter molecules & submit data sets to QCA: Run the code in Jupyter notebook for all training data sets Deadline: Fri March 13, 2020 |
|
|
| Data sets to filter (suggestions from @David Mobley & @Hyesu Jang ) Submit & filtered these data sets as soon as they are processed: High priority = orange Lower priority = green Bayer data set Roche eMolecules discrepancy set Coverage set Pfizer 100 fragment discrepancy set SiliconTx [tentative, check with Daniel if this is finished.] NCI250k [tentative] DrugBank FDA drugs Maybe Sellers fragment set:https://github.com/openforcefield/qca-dataset-submission/issues/63 | Notes: Check the coverage of the DS before filtering molecules. Make a heat map to check similarity between DS. |
| Meeting March 18th (Jeff, Hyesu, Jessica) |
Meeting notes-
Dataset formatting/organization
|
Finish all data sets by March 20th. Deadline: Fri March 20, 2020 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Notes:
Goal of the project is to make use of parameters in training data set more evenly and broaden the chemistry we are fitting to
Finalized data set should contain around the same order of magnitude of target molecules as in recently Parsley fit
If filtering data sets individually and then combining results leads to too large of a data set, we can try to combine data sets prior to performing filtering to reduce training data set size