Generation of simple molecule set using Constructure
Goal
Generation of a “simple molecule” torsiondrive dataset
TODO (2021-04-01)
Approach 2. substituent+substituent → new molecule
(1) For each molecule, using scaffolds, identify substituents; (2) List up all substituents; (3) Filter complex substituents ( by checking number of rotatable bonds, number of rings ); (4) In the enumeration stage, instead of enumerating molecule by adding substituents to a scaffold, combine two substituents into a molecule. The bond formed during the combination becomes a center bond, which will be rotated during its torsion scan;
Difference with Approach 1: approach 1(described below) generates molecules using
enumerate_combinations
function fromConstructure
package, which attaches a set of substituents to a scaffold. While it has its own benefits, since it introduces huge complexity in the resulting molecules with many rotatable bonds, we came up with the second approach, which generates much simple molecules with one effective (interesting) rotatable bond.
Test generation of a molecule set using Roche set only
1. Using the concept of scaffold, generate a list of substituents and filter complex ones.
# mols | filter1 (harsh filter) : # rot bonds =0, # rings <=1 | filter2 : # rot bonds = 0 and # rings = 1 + # rot bonds < 2 and # rings = 0 | |
---|---|---|---|
Roche | 468 | 106 | 139 |
Tested two filters with different thresholds.
filter1:
harsh filter, filter any substituent having one or more rotatable bonds;
aliphatic chain(acyclic substituent): 28 (26.4 %), 2. aliphatic rings: 32, 3. 6-membered aromatic rings:24, 4. 5-membered aromatic rings:22
filter2:
Includes longer chains to decrease the proportion of rings in the substituent list;
aliphatic chain: 61 (44 %), 2. aliphatic rings: 32, 3. 6-membered aromatic rings:24, 4. 5-membered aromatic rings:22
2. From the filtered list of substituents, combine two substituents into a molecule.
Filtered trivial molecules/ cutoff on the number of heavy atoms if needed(to control molecule set size)
acyclic structures (by combining two acyclic substituents) (1533)
Full molecule set (8794)
* Note that the list of substituents generated using Roche set only includes phenyls with ortho substituent. → “will remove phenyls with ortho substituents and add phenyls with meta/para substituents”
Approach 1. Usage of Constructure
enumeration method
Combination of a set of substructures and a scaffold → new molecule
1. Scaffolds - used 151 scaffolds obtained from Constructure
2. substituents (functional groups)
: To obtain a reasonable list of substituents, I generated the list using the existing molecule sets. (Roche set, Pfizer discrepancy set, and eMolecules discrepancy set. )
(1) Generation of list of substituents
For each molecule set (Roche, Pfizer)
Fragments molecule using each scaffold, identify substituents, filter out complex substituents (
harsh filter: # of rotatable bond =0, # of rings <= 1)
# mols | filter1 # rot bonds <=1, # rings <=1 | harsh filter # rot bonds =0, # rings <=1 | |
---|---|---|---|
Roche | 468 | 271 | 106 |
Pfizer | 100 | 102 | 57 |
eMolecules | 2904 | 2343 | 748 |
combine lists into a single list of substituents
In total, 137 substituents from Roche and Pfizer (29 aliphatic chains, 45 aliphatic rings, 34 6-membered aromatic rings, 29 5-membered aromatic rings)
3. Enumerate combinations using Constructure
tool
Determination of substituents
If r_group is by default set to take only halogen, only allows halogen substituents
For equivalent adding sites, like R1 and R2 in ketone scaffold, defines
substituents = {1:substituents(w/ 137), 2: [SUBSTITUENTS[‘hydrogen’], [SUBSTITUENTS[‘methyl’]]}
for simplicity.
(1) scaffolds w/ 1 r_groups(54, such as aldehyde, alcohol, ...)
number of molecules generated: 6501
(2) scaffolds w/ 2 r_groups(37, such as ketone, oxime, …)
number of molecules generated: 120243
(3) scaffolds w/ # of r_group > 2