Generation of simple molecule set using Constructure

Goal

Generation of a “simple molecule” torsiondrive dataset

TODO (2021-04-01)

1. Remove ortho substituents from substituent list, add ones with meta/para substituents
2. Remove similar molecules using MACCS keys fingerprints
3. Check coverage of torsion parameters → generate a draft of molecule set (~3000 entries)
4. Addition of intra H bond filter : by using SMIKRS pattern matching
5. - Here’s a notebook with over-represented parameters in the larger discrepancy molecules, suggest any Check the coverage of problematic substituents, which showed large discrepancies in Pavan’s 1.3.0 benchmarks
6. Range of WBOs of each training data subset, a list of scans training a certain torsion parameter
7. addition of double bond rotating torsion scans
8.

Approach 2. substituent+substituent → new molecule

  • (1) For each molecule, using scaffolds, identify substituents; (2) List up all substituents; (3) Filter complex substituents ( by checking number of rotatable bonds, number of rings ); (4) In the enumeration stage, instead of enumerating molecule by adding substituents to a scaffold, combine two substituents into a molecule. The bond formed during the combination becomes a center bond, which will be rotated during its torsion scan;

  • Difference with Approach 1: approach 1(described below) generates molecules using enumerate_combinations function from Constructurepackage, which attaches a set of substituents to a scaffold. While it has its own benefits, since it introduces huge complexity in the resulting molecules with many rotatable bonds, we came up with the second approach, which generates much simple molecules with one effective (interesting) rotatable bond.

Test generation of a molecule set using Roche set only

1. Using the concept of scaffold, generate a list of substituents and filter complex ones.



# mols

filter1 (harsh filter)

: # rot bonds =0, # rings <=1

filter2

: # rot bonds = 0 and # rings = 1

+ # rot bonds < 2 and # rings = 0



# mols

filter1 (harsh filter)

: # rot bonds =0, # rings <=1

filter2

: # rot bonds = 0 and # rings = 1

+ # rot bonds < 2 and # rings = 0

Roche 

468

106

139

  • Tested two filters with different thresholds.

  • filter1:

    • harsh filter, filter any substituent having one or more rotatable bonds;

    • aliphatic chain(acyclic substituent): 28 (26.4 %), 2. aliphatic rings: 32, 3. 6-membered aromatic rings:24, 4. 5-membered aromatic rings:22

  • filter2:

    • Includes longer chains to decrease the proportion of rings in the substituent list;

    • aliphatic chain: 61 (44 %), 2. aliphatic rings: 32, 3. 6-membered aromatic rings:24, 4. 5-membered aromatic rings:22

2. From the filtered list of substituents, combine two substituents into a molecule.

  • Filtered trivial molecules/ cutoff on the number of heavy atoms if needed(to control molecule set size)

  • acyclic structures (by combining two acyclic substituents) (1533)

  • Full molecule set (8794)

* Note that the list of substituents generated using Roche set only includes phenyls with ortho substituent. → “will remove phenyls with ortho substituents and add phenyls with meta/para substituents”


Approach 1. Usage of Constructure enumeration method

Combination of a set of substructures and a scaffold → new molecule 

1. Scaffolds - used 151 scaffolds obtained from Constructure

2. substituents (functional groups) 

: To obtain a reasonable list of substituents, I generated the list using the existing molecule sets. (Roche set, Pfizer discrepancy set, and eMolecules discrepancy set. )

(1) Generation of list of substituents 

  • For each molecule set (Roche, Pfizer) 

  • Fragments molecule using each scaffold, identify substituents, filter out complex substituents (

harsh filter: # of rotatable bond =0, # of rings <= 1) 



# mols

filter1

# rot bonds <=1, # rings <=1

harsh filter

# rot bonds =0, # rings <=1



# mols

filter1

# rot bonds <=1, # rings <=1

harsh filter

# rot bonds =0, # rings <=1

Roche 

468

271

106

Pfizer

100

102

57

eMolecules

2904

2343

748

  • combine lists into a single list of substituents

    • In total, 137 substituents from Roche and Pfizer (29 aliphatic chains, 45 aliphatic rings, 34 6-membered aromatic rings, 29 5-membered aromatic rings)

3. Enumerate combinations using Constructure tool

  • Determination of substituents 

  • If r_group is by default set to take only halogen, only allows halogen substituents

  • For equivalent adding sites, like R1 and R2 in ketone scaffold, defines substituents = {1:substituents(w/ 137), 2: [SUBSTITUENTS[‘hydrogen’], [SUBSTITUENTS[‘methyl’]]} for simplicity. 

(1) scaffolds w/ 1 r_groups(54, such as aldehyde, alcohol, ...)

  • number of molecules generated: 6501

(2) scaffolds w/ 2 r_groups(37, such as ketone, oxime, …)

  • number of molecules generated: 120243

(3) scaffolds w/ # of r_group > 2