Goal

Generation of a simple-molecule-only torsion scan dataset

Scheme

(1) For each molecule from an input molecule set, using scaffolds, identify substituents; (2) List up all substituents; (3) Filter complex substituents ( by checking number of rotatable bonds, number of rings ); (4) In the enumeration stage, instead of enumerating molecule by adding substituents to a scaffold, combine two substituents into a molecule. The bond formed during the combination becomes a center bond, which will be rotated during its torsion scan;

Identification of substituents from the input molecule using scaffold.

Combination of single pair of substituents into a molecule,

([1*])c1ccccc1 + [*:1]Nc1ccccc1 → c1cc[c:1](cc1)[NH:2]c2ccccc2

1. Generation of substituent list

2. Generation of molecule set

3. Curation of molecule set

3.1. Remove similar molecules using MACCS keys fingerprints and Check coverage of torsion parameters

(1) list molecules matching to each torsion parameter

(2) using MACCS keys fingerprints, cluster each molecule list into ~20 clusters

(3) Pick one molecule per each cluster to generate subset of list with around 20 molecules for each torsion parameter

number of uncovered substituents

number of missing scaffolds

initial molecule set

23

method1. select a center molecule

38

26

method2. select the smallest molecule

65

28

method3. random selection

~26

27

(4) (method1) coverage of torsion parameters

TODO

  1. remove potential H-bond forming molecules (using SB’s idea)

  2. generate smiles using RDKit (for dataset validation step)

  3. regenerate the list of substituents( protonation and/or add more substituents to increase parameter coverage)

  4. reduce the size to ~2000 torsions (~10 targets per parameter)