1. Generation of substituent list
Removed phenyls with ortho-substituents
Filter: cyclic substituents with (1) zero rotatable bond (2) # rings =1 or acyclic substituents with # rotatable bonds <2
Combined Roche, Coverage, Pfizer, Bayer: 361 substituents (Acyclic aliphatic: 183, 2. aliphatic rings: 100, 3. 6-membered aromatic rings:50, 4. 5-membered aromatic rings:28)
Roche set
Acyclic aliphatic: 61, 2. aliphatic rings: 21, 3. 6-membered aromatic rings:5, 4. 5-membered aromatic rings:11
Coverage set
Acyclic aliphatic: 76, 2. aliphatic rings: 2, 3. 6-membered aromatic rings:6, 4. 5-membered aromatic rings:2
Pfizer set
Acyclic aliphatic: 24, 2. aliphatic rings: 9, 3. 6-membered aromatic rings:6, 4. 5-membered aromatic rings:7
eMolecules set (okay not to include eMolecules set?)
aliphatic chain: 148, 2. aliphatic rings: 75, 3. 6-membered aromatic rings:90, 4. 5-membered aromatic rings:51
Bayer set
Acyclic aliphatic: 116, 2. aliphatic rings: 86, 3. 6-membered aromatic rings:42, 4. 5-membered aromatic rings:16
...
Using 361 substituents, generated 59086 molecules (align by mol weights )
View file | ||
---|---|---|
|
View file | ||
---|---|---|
|
3. Curation of molecule set
3.1. Remove similar molecules using MACCS keys fingerprints and Check coverage of torsion parameters
...
Picking a center molecule(one with the largest sum of similarity indices) or the simplest molecule→ constantly choose certain substituents?
When choosing Choosing center molecules → need to check coverage of substituent list
When choosing simple molecules → need check coverage of substituent list
38 substituents out of 361 not included.(26 scaffolds missing)
View file | ||
---|---|---|
|
Choosing simple molecules
65 substituents out of 361 not included. (28 scaffolds missing)
Random picking
~26 substituents out of 361 not included. (27 scaffolds missing)
maybe, if the molecule covers all scaffolds (3-4 PM)
(5) check coverage of torsion parameter (missing torsions)
→ Generate torsiondrive dataset to submit
...
3.2. Internal H bond forming mols : Better SMIRKS needed. How to consider spatial arrangement of 1-n chain
(1) Test filtering w/ oversimplified SMIRKS
...
(2) More specific SMIKRS patterns
Filter
[n,N,o,O,F]([H])[!#1][!#1]~!@[!#1;r]([#7X2;r])
# molecules matched : 1430 (out of 59086)
Right hand side mols dont seem to form internal H bond
Filter
[n,N,o,O,F]([H])[!#1]~!@[!#1]~!@[!#1;r]([#7X2;r])
# molecules matched : 1060 ( <2 % of total)
How to exclude right mol?
...
TODO (2021-04-01)
- 1. Remove ortho substituents from substituent list, add ones with meta/para substituents
- 2. Remove similar molecules using MACCS keys fingerprints
- 3. Check coverage of torsion parameters → generate a draft of molecule set (~3000 entries)
- 4. Addition of intra H bond filter : by using SMIKRS pattern matching
- 5. Check the coverage of problematic substituents, which showed large discrepancies in Pavan’s 1.3.0 benchmarks
- 6. Range of WBOs of each training data subset, a list of scans training a certain torsion parameter
- 7. addition of double bond rotating torsion scans
...
damn installation
1. conda create --name constructure -c conda-forge -c openeye -c omnia pydantic openeye-toolkits cmiles ipykernel python=3.8
2. python setup.py develop (constructure)
3. python setup.py install (fragmenter)
4. conda install -c conda-forge pyyaml
* Additionally openforcefield has been installed