Report 2 (2021-04-08)

Goal

Generation of a simple-molecule-only torsion scan dataset

  • The issue of torsion parameter contamination by large internal non-bonded interactions has been brought up repeatedly in the past discussions.

  • Due to the nature of Roche set, many molecules used to generate the current torsion parameter training set have a phenyl group with ortho substituent, and ortho substituent causes large steric hindrance. And the complexity of molecules was not considered when designing the current training dataset.

  • Generation of new torsion parameter training set, which excludes (1) complex molecules and (2) molecules with high steric hindrance is necessary

Scheme

(1) For each molecule from an input molecule set, using scaffolds, identify substituents; (2) List up all substituents; (3) Filter complex substituents ( by checking number of rotatable bonds, number of rings ); (4) In the enumeration stage, instead of enumerating molecule by adding substituents to a scaffold, combine two substituents into a molecule. The bond formed during the combination becomes a center bond, which will be rotated during its torsion scan;

Identification of substituents from the input molecule using scaffold.

Combination of single pair of substituents into a molecule,

([1*])c1ccccc1 + [*:1]Nc1ccccc1 → c1cc[c:1](cc1)[NH:2]c2ccccc2

1. Generation of substituent list

  • Removed phenyls with ortho-substituents

  • Filter: cyclic substituents with (1) zero rotatable bond (2) # rings =1 or acyclic substituents with # rotatable bonds <2

  • Combined lists from Roche, Coverage, Pfizer, Bayer set: 361 substituents (Acyclic aliphatic: 183, 2. aliphatic rings: 100, 3. 6-membered aromatic rings:50, 4. 5-membered aromatic rings:28)

2. Generation of molecule set

  • combine two substituents into one molecule;

  • From 361 substituents, generated 59086 molecules;

3. Curation of molecule set

  • Before clustering, will add two filters to exclude (1) internal H bond forming molecules; (2) molecules chemically non-synthesizable.

3.1. Remove similar molecules using MACCS keys fingerprints and Check coverage of torsion parameters

(1) list molecules matching to each torsion parameter

(2) using MACCS keys fingerprints, cluster each molecule list into ~20 clusters

(3) Pick one molecule per each cluster to generate subset of list with around 20 molecules for each torsion parameter

 

number of uncovered substituents

number of missing scaffolds

 

number of uncovered substituents

number of missing scaffolds

initial molecule set

 

23

method1. select a center molecule

38

26

method2. select the smallest molecule

65

28

method3. random selection

~26

27

 

 

 

 

(4) (method1) coverage of torsion parameters

  • 2433 torsions selected

  • 43 torsions(out of 167) are uncovered;

    • double, triple bond, in-ring rotation + 15 torsions

    • 15 torsions: t30-33 ([:1]-[#6X4;r3:2]-[#6X3:3]-[:4]), t50([*:1]-[#6X4:2]-[#7X4:3]-[*:4]), 51s([*:1]-[#6X4:2]-[#7X3:3]-[*:4]), t58([*:1]-[#7X4:2]-[#6X3:3]~[*:4]), t68([*:1]~[#7X3,#7X2-1:2]-[#6X3:3]~[*:4]), t104([*:1]=[#8X2+1:2]-[#6:3]~[*:4]), t117([*:1]-[#8:2]-[#8H1:3]-[*:4]), t136([#6X3:1]-[#16X4,#16X3+0:2]-[#7X4,#7X3:3]-[#1:4]), t138([#6X3:1]-[#16X4,#16X3+0:2]-[#7X4,#7X3:3]-[#6X4:4]), t141([#6X3:1]-[#16X4,#16X3+0:2]-[#7X3:3]-[#6X3:4])

 

TODO

  1. remove potential H-bond forming molecules (using SB’s idea)

  2. generate smiles using RDKit (for dataset validation step)

  3. regenerate the list of substituents( protonation and/or add more substituents to increase parameter coverage)

  4. reduce the size to ~2000 torsions (~10 targets per parameter)