...
...
...
...
Goal
Generation of a “simple molecule” torsiondrive dataset
TODO (2021-04-01)
- 1. Remove ortho substituents from substituent list, add ones with meta/para substituents
- 2. Remove similar molecules using MACCS keys fingerprints
- 3. Check coverage of torsion parameters → generate a draft of molecule set (~3000 entries)
- 4. Addition of intra H bond filter : by using SMIKRS pattern matching
- 5. - Here’s a notebook with over-represented parameters in the larger discrepancy molecules, suggest any Check the coverage of problematic substituents, which showed large discrepancies in Pavan’s 1.3.0 benchmarks
- 6. Range of WBOs of each training data subset, a list of scans training a certain torsion parameter
- 7. addition of double bond rotating torsion scans
- 8.
Approach 2. substituent+substituent → new molecule
(1) For each molecule, using scaffolds, identify substituents; (2) List up all substituents; (3) Filter complex substituents ( by checking number of rotatable bonds, number of rings ); (4) In the enumeration stage, instead of enumerating molecule by adding substituents to a scaffold, combine two substituents into a molecule. The bond formed during the combination becomes a center bond, which will be rotated during its torsion scan;
Difference with Approach 1: approach 1(described below) generates molecules using
enumerate_combinations
function fromConstructure
package, which attaches a set of substituents to a scaffold. While it has its own benefits, since it introduces huge complexity in the resulting molecules with many rotatable bonds, we came up with the second approach, which generates much simple molecules with one effective (interesting) rotatable bond.
Test generation of a molecule set using Roche set only
1. Using the concept of scaffold, generate a list of substituents and filter complex ones.
# mols | filter1 (harsh filter) : # rot bonds =0, # rings <=1 | filter2 : # rot bonds = 0 and # rings = 1 + # rot bonds < 2 and # rings = 0 | |
---|---|---|---|
Roche | 468 | 106 | 139 |
Tested two filters with different thresholds.
filter1:
harsh filter, filter any substituent having one or more rotatable bonds;
aliphatic chain(acyclic substituent): 28 (26.4 %), 2. aliphatic rings: 32, 3. 6-membered aromatic rings:24, 4. 5-membered aromatic rings:22
filter2:
Includes longer chains to decrease the proportion of rings in the substituent list;
aliphatic chain: 61 (44 %), 2. aliphatic rings: 32, 3. 6-membered aromatic rings:24, 4. 5-membered aromatic rings:22
View file | ||
---|---|---|
|
2. From the filtered list of substituents, combine two substituents into a molecule.
Filtered trivial molecules/ cutoff on the number of heavy atoms if needed(to control molecule set size)
acyclic structures (by combining two acyclic substituents) (1533)
...
View file | ||
---|---|---|
|
* Note that the list of substituents generated using Roche set only includes phenyls with ortho substituent. → “will remove phenyls with ortho substituents and add phenyls with meta/para substituents”
...
Approach 1. Usage of Constructure
enumeration method
Combination of a set of substructures and a scaffold → new molecule
1. Scaffolds - used 151 scaffolds obtained from Constructure
2. substituents (functional groups)
: To obtain a reasonable list of substituents, I generated the list using the existing molecule sets. (Roche set, Pfizer discrepancy set, and eMolecules discrepancy set. )
...
harsh filter: # of rotatable bond =0, # of rings <= 1)
# mols | filter1 # rot bonds <=1, # rings <=1 | harsh filter # rot bonds =0, # rings <=1 | |
---|---|---|---|
Roche | 468 | 271 | 106 |
Pfizer | 100 | 102 | 57 |
eMolecules | 2904 | 2343 | 748 |
combine lists into a single list of substituents
In total, 137 substituents from Roche and Pfizer (29 aliphatic chains, 45 aliphatic rings, 34 6-membered aromatic rings, 29 5-membered aromatic rings)
View file | ||
---|---|---|
|
3. Enumerate combinations using Constructure
tool
Determination of substituents
If r_group is by default set to take only halogen, only allows halogen substituents
For equivalent adding sites, like R1 and R2 in ketone scaffold, defines
substituents = {1:substituents(w/ 137), 2: [SUBSTITUENTS[‘hydrogen’], [SUBSTITUENTS[‘methyl’]]}
for simplicity.
...