View Source

Goal

Generation of a simple-molecule-only torsion scan dataset

The issue of torsion parameter contamination by large internal non-bonded interactions has been brought up repeatedly in the past discussions.
Due to the nature of Roche set, many molecules used to generate the current torsion parameter training set have a phenyl group with ortho substituent, and ortho substituent causes large steric hindrance. And the complexity of molecules was not considered when designing the current training dataset.
Generation of new torsion parameter training set, which excludes (1) complex molecules and (2) molecules with high steric hindrance is necessary

Scheme

(1) For each molecule from an input molecule set, using scaffolds, identify substituents; (2) List up all substituents; (3) Filter complex substituents ( by checking number of rotatable bonds, number of rings ); (4) In the enumeration stage, instead of enumerating molecule by adding substituents to a scaffold, combine two substituents into a molecule. The bond formed during the combination becomes a center bond, which will be rotated during its torsion scan;

Force Fields > Report 2 (2021-04-08) > image-20210409-121330.png

Identification of substituents from the input molecule using scaffold.

Force Fields > Report 2 (2021-04-08) > image-20210409-123052.png

Combination of single pair of substituents into a molecule,

([1*])c1ccccc1 + [*:1]Nc1ccccc1 → c1cc[c:1](cc1)[NH:2]c2ccccc2

1. Generation of substituent list

Removed phenyls with ortho-substituents
Filter: cyclic substituents with (1) zero rotatable bond (2) # rings =1 or acyclic substituents with # rotatable bonds <2
Combined lists from Roche, Coverage, Pfizer, Bayer set: 361 substituents (Acyclic aliphatic: 183, 2. aliphatic rings: 100, 3. 6-membered aromatic rings:50, 4. 5-membered aromatic rings:28)

2. Generation of molecule set

combine two substituents into one molecule;
From 361 substituents, generated 59086 molecules;

3. Curation of molecule set

Before clustering, will add two filters to exclude (1) internal H bond forming molecules; (2) molecules chemically non-synthesizable.

3.1. Remove similar molecules using MACCS keys fingerprints and Check coverage of torsion parameters

(1) list molecules matching to each torsion parameter

(2) using MACCS keys fingerprints, cluster each molecule list into ~20 clusters

Force Fields > Report 2 (2021-04-08) > image-20210408-155512.png

(3) Pick one molecule per each cluster to generate subset of list with around 20 molecules for each torsion parameter

	number of uncovered substituents	number of missing scaffolds
initial molecule set		23
method1. select a center molecule	38	26
method2. select the smallest molecule	65	28
method3. random selection	~26	27

(4) (method1) coverage of torsion parameters

2433 torsions selected
43 torsions(out of 167) are uncovered;
- double, triple bond, in-ring rotation + 15 torsions
- 15 torsions: t30-33 ([:1]-[#6X4;r3:2]-[#6X3:3]-[:4]), t50([*:1]-[#6X4:2]-[#7X4:3]-[*:4]), 51s([*:1]-[#6X4:2]-[#7X3:3]-[*:4]), t58([*:1]-[#7X4:2]-[#6X3:3]~[*:4]), t68([*:1]~[#7X3,#7X2-1:2]-[#6X3:3]~[*:4]), t104([*:1]=[#8X2+1:2]-[#6:3]~[*:4]), t117([*:1]-[#8:2]-[#8H1:3]-[*:4]), t136([#6X3:1]-[#16X4,#16X3+0:2]-[#7X4,#7X3:3]-[#1:4]), t138([#6X3:1]-[#16X4,#16X3+0:2]-[#7X4,#7X3:3]-[#6X4:4]), t141([#6X3:1]-[#16X4,#16X3+0:2]-[#7X3:3]-[#6X3:4])

TODO

remove potential H-bond forming molecules (using SB’s idea)
generate smiles using RDKit (for dataset validation step)
regenerate the list of substituents( protonation and/or add more substituents to increase parameter coverage)
reduce the size to ~2000 torsions (~10 targets per parameter)