Contributors: David Mobley, Lee-Ping Wang, Hyesu Jang, Jeff Wagner, Chris Bayly, Josh Horton, Daniel Smith, Chaya Stern, Jessica Maat
Background:
The Open Force Field Initiative is working on developing optimization training data sets via a fingerprint and clustering method.
The aim of this project is to pull chemically diverse molecules from a range of data sets to survey a larger chemical space for our May release force field.
The Bayer set contains 5054 molecules that are large flexible pharmaceutically relevant molecules ranging from 12-30 heavy atoms.
Aim:
Limit the number of conformers in a patented data set from Bayer for optimization data set.
Reduce data set to ~3 conformers/molecule.
Problem:
Current fingerprint & clustering methods result in 525 molecules & 16,242 conformers.
Hypothesized contributors to large number of conformers:
Large molecule size
Excessive rotatable bonds
Approach:
Try numerous size filtering strategies for molecule size that try to preserve chemical diversity and measure # of molecules and conformers.
If #1 is not successful, move onto rotatable bond filtering.
if #1 & #2 are not successful, move onto Fragmentation.
Experimental notes:
Clustering method: DBSCAN eps = 0.3, min_samples = 4
Fingerprint method: MACCS (supported by previous experiments from Hyesu Jang)
Method | # of molecules | # of conformers | notes |
---|---|---|---|
Randomized size selection | 524 | 10454 | Randomly select molecules from clusters and except or reject the molecule if the # of atoms is less than 25. This method still lead to an excessive # of conformers |
Select smallest molecule from each cluster | 524 | 3600 | This method helped reduce the # of conformers, but there still was an excessive number of final conformers. There is also less chemical diversity of a molecule utilizes multiple parameters and is the smallest in several clusters it will be selected multiple times. |
Set conformers cut off to 4 | 436 | 1850 | This method worked best because it was able to reduce the # of conformers to a reasonable amount. This was implemented by setting max conformers in Fragmenter to 5. |
Conclusion:
Setting the max conformers was the most effective method in maintaining chemical diversity when selecting molecules from clusters and reducing the number of clusters. Other methods, such as selecting the smallest molecule in the cluster, might reduce # of conformers but also reduce chemical diversity. The main goal of the training data set selection is to increase chemical diversity to increase the chemical space our force field covers.