Contributors: David Mobley, Lee-Ping Wang, Hyesu Jang, Jeff Wagner, Chris Bayly, Josh Horton, Daniel Smith, Chaya Stern, Jessica Maat
Background:
The Open Force Field Initiative is working on developing optimization training data sets via a fingerprint and clustering method.
The aim of this project is to pull chemically diverse molecules from a range of data sets to survey a larger chemical space for our May release force field.
The Bayer set contains 5054 molecules that are large flexible pharmaceutically relevant molecules ranging from 12-30 heavy atoms.
Aim:
Limit the number of conformers in a patented data set from Bayer for optimization data set.
Reduce data set to ~3 conformers/molecule.
Problem:
Current fingerprint & clustering methods result in 525 molecules & 16,242 conformers.
Hypothesized contributors to large number of conformers:
Large molecule size
Excessive rotatable bonds
Approach:
Try numerous size filtering strategies for molecule size that try to preserve chemical diversity and measure # of molecules and conformers.
If #1 is not successful, move onto rotatable bond filtering.
if #1 & #2 are not successful, move onto Fragmentation.
Experimental notes:
Clustering method: DBSCAN eps = 0.3, min_samples = 4
Fingerprint method: MACCS (supported by previous experiments from Hyesu Jang)
Method | # of molecules | # of conformers | notes |
---|---|---|---|
Randomized size selection | 524 | 10454 | WIP |