Report 2 (2021-04-08)
Goal
Generation of a simple-molecule-only torsion scan dataset
The issue of torsion parameter contamination by large internal non-bonded interactions has been brought up repeatedly in the past discussions.
Due to the nature of Roche set, many molecules used to generate the current torsion parameter training set have a phenyl group with ortho substituent, and ortho substituent causes large steric hindrance. And the complexity of molecules was not considered when designing the current training dataset.
Generation of new torsion parameter training set, which excludes (1) complex molecules and (2) molecules with high steric hindrance is necessary
Scheme
(1) For each molecule from an input molecule set, using scaffolds, identify substituents; (2) List up all substituents; (3) Filter complex substituents ( by checking number of rotatable bonds, number of rings ); (4) In the enumeration stage, instead of enumerating molecule by adding substituents to a scaffold, combine two substituents into a molecule. The bond formed during the combination becomes a center bond, which will be rotated during its torsion scan;
Identification of substituents from the input molecule using scaffold.
Combination of single pair of substituents into a molecule,
([1*])c1ccccc1 + [*:1]Nc1ccccc1 → c1cc[c:1](cc1)[NH:2]c2ccccc2
1. Generation of substituent list
Removed phenyls with ortho-substituents
Filter: cyclic substituents with (1) zero rotatable bond (2) # rings =1 or acyclic substituents with # rotatable bonds <2
Combined lists from Roche, Coverage, Pfizer, Bayer set: 361 substituents (Acyclic aliphatic: 183, 2. aliphatic rings: 100, 3. 6-membered aromatic rings:50, 4. 5-membered aromatic rings:28)
2. Generation of molecule set
combine two substituents into one molecule;
From 361 substituents, generated 59086 molecules;
3. Curation of molecule set
Before clustering, will add two filters to exclude (1) internal H bond forming molecules; (2) molecules chemically non-synthesizable.
3.1. Remove similar molecules using MACCS keys fingerprints and Check coverage of torsion parameters
(1) list molecules matching to each torsion parameter
(2) using MACCS keys fingerprints, cluster each molecule list into ~20 clusters
(3) Pick one molecule per each cluster to generate subset of list with around 20 molecules for each torsion parameter
| number of uncovered substituents | number of missing scaffolds |
---|---|---|
initial molecule set |
| 23 |
method1. select a center molecule | 38 | 26 |
method2. select the smallest molecule | 65 | 28 |
method3. random selection | ~26 | 27 |
|
|
|
(4) (method1) coverage of torsion parameters
2433 torsions selected
43 torsions(out of 167) are uncovered;
double, triple bond, in-ring rotation + 15 torsions
15 torsions: t30-33 (
[:1]-[#6X4;r3:2]-[#6X3:3]-[:4]
), t50([*:1]-[#6X4:2]-[#7X4:3]-[*:4]
), 51s([*:1]-[#6X4:2]-[#7X3:3]-[*:4]
), t58([*:1]-[#7X4:2]-[#6X3:3]~[*:4]
), t68([*:1]~[#7X3,#7X2-1:2]-[#6X3:3]~[*:4]
), t104([*:1]=[#8X2+1:2]-[#6:3]~[*:4]
), t117([*:1]-[#8:2]-[#8H1:3]-[*:4]
), t136([#6X3:1]-[#16X4,#16X3+0:2]-[#7X4,#7X3:3]-[#1:4]
), t138([#6X3:1]-[#16X4,#16X3+0:2]-[#7X4,#7X3:3]-[#6X4:4]
), t141([#6X3:1]-[#16X4,#16X3+0:2]-[#7X3:3]-[#6X3:4]
)
TODO
remove potential H-bond forming molecules (using SB’s idea)
generate smiles using RDKit (for dataset validation step)
regenerate the list of substituents( protonation and/or add more substituents to increase parameter coverage)
reduce the size to ~2000 torsions (~10 targets per parameter)