ObjectiveDevelop tool for training data set selection for upcoming FF releases. Objectives:
| TeamTeam members: Hyesu Jang David Mobley Jessica Maat (Deactivated) Lee-Ping Wang Jeffrey Wagner Joshua Horton |
Timeline to March 20th for data set submission | ||
---|---|---|
Goal | Subgoal | Status & notes |
A. Test case: create submission data set from a single data set Deadline: Fri March 6, 2020 | in progress | |
Start branch on QCA data set submission repository for Hyesu & Jessica to collaborate on Jupyter notebook | ||
Start with DLM’s jupyter notebook and work on clustering on smaller test case. Use graph similarity difference to choose most disparate molecules.
| Completed | |
Jessica & Hyesu meet to discuss ideas for data set clustering (March 5th @ 1-2pm P.S.T.) | Complete Notes from meeting: Separate data filtering for torsion and optimization data set. Hyesu will torsion drive data set Jessica will focus on other terms bond and angle for optimization data set Roche set, discrepancy eMolecules, Pfizer, Bayer set - which data sets do we want to use? | |
| Complete Notes: Discussed current clustering with David Mobley and he suggests using DBSCAN because of the following points:
Clustering notes from Jessica Maat (Deactivated) :
| |
| Completed Lee-Ping Wang suggests to combine the final filtered data sets to preserve individual chemical space. | |
| ||
Jessica & Hyesu meet March 6th @ 12 pm PST to discuss updates | Complete When selecting a molecule from the cluster, currently Jessica is selecting smallest molecule from the cluster. Should update code to consider choosing a molecule that is more representative of the cluster, for example the centroid. | |
Jessica finish example optimization DS
| ||
Hyesu finish torsion drive data set filter/generation by 3-8-20 | Done | |
B. Filter molecules & submit data sets to QCA: Run the code in Jupyter notebook for all training data sets Deadline: Fri March 13, 2020 | ||
Data sets to filter (suggestions from David Mobley & Hyesu Jang ) Submit & filtered these data sets as soon as they are processed: High priority = orange Lower priority = green
| Notes: Check the coverage of the DS before filtering molecules. Make a heat map to check similarity between DS. | |
Meeting March 18th (Jeff, Hyesu, Jessica) |
Meeting notes-
Dataset formatting/organization
| |
Finish all data sets by March 20th. Deadline: Fri March 20, 2020 | ||
Notes:
Goal of the project is to make use of parameters in training data set more evenly and broaden the chemistry we are fitting to
Finalized data set should contain around the same order of magnitude of target molecules as in recently Parsley fit
If filtering data sets individually and then combining results leads to too large of a data set, we can try to combine data sets prior to performing filtering to reduce training data set size
...