Second Generation Torsion "Test Set" Design Scheme
Contributors: @Hyesu Jang , @Lee-Ping Wang
Goal: Making two separate test sets:
Neighboring set: a set contains chemically similar torsions to the training sets;
Diverse set: a set covers more broad chemical diversity.
*Change has been made based on the feedback from @David Mobley and @Jessica Maat (Deactivated) .
Selection Scheme:
1. Neighboring set selection scheme
Used the same clusters determined during the training set generation. (https://openforcefield.atlassian.net/wiki/x/d4DHDw )
Clustering scheme:
For each torsion parameter, list effective torsions matching to the parameter;
Cluster list using MACCS keys and DBSCAN with adjustment of variables.
For each list of molecules, calculate distance matrix using MACCS keys and Tanimoto similarity measure;
Cluster the distance matrix using DBSCAN.
epsilon = 0.4, min_samples = 2
If number of clusters is < 2, re-cluster the matrix with varying epsilon value from 0.5 to 0.1 until the number of clusters is >= 2
For each cluster, select one molecule which is the most similar (but not the same) with a molecule selected to the training set based on the Tanimoto similarity measure . (molecules only differ in stereochemistry with the molecule in the training set were also excluded.)
2. Diverse set selection scheme
- One approach, which uses clusters:
For each cluster, after selecting a molecule for the neighboring set, select all the remainders randomly select one among the remained molecules which are not the same with any molecule in the training set or any molecule selected to be in the neighboring set.
For noise, all the molecules assigned to the noise are added to the diverse set.
- Another approach, without using clusters:
There was one concern brought up by L-P about the size of the diverse set when including all the torsions which can be generated from input molecule sets. Depending on the size of the input molecule set, the size of the diversity set can be too big to deal with. So one possible approach to eliminate this concern is to (1) randomly select certain portion of molecules from input molecule set and (2) generate all possible torsions out of the selected molecules. This is beneficial in that it can control the size of the test set.
Comment from @David Mobley : “I think for selecting the diverse set, it would be quite reasonable to simply pick random molecules from our input sets until we reach the desired number of molecules or – if parameter coverage is a concern – to pick random molecules utilizing targeted parameters. I don’t think there is any reason we have to use chemical similarity/clustering to pick the diverse set since that’s handled by the neighboring set; the point of the diverse set is to get diversity.”
Selected sets:
Input molecule set: Roche set
training set (142 1-D): https://drive.google.com/open?id=1bk1KX-rc3Xmhf5cdVWXLCm-EjZ4a7_66
neighboring set (81 1-D): https://drive.google.com/open?id=1leNfbeGpnHk8lDhTwKRAohm4S_qULv82
diverse set (146 1-D): https://drive.google.com/open?id=1wfD0MXbdEohIBvIYPfcw7Mz8WdDoE8Gg
(Note that the diverse set shared here was generated using the first approach.)