Report 1 (2021-04-05)
1. Generation of substituent list
Removed phenyls with ortho-substituents
Filter: cyclic substituents with (1) zero rotatable bond (2) # rings =1 or acyclic substituents with # rotatable bonds <2
Combined Roche, Coverage, Pfizer, Bayer: 361 substituents (Acyclic aliphatic: 183, 2. aliphatic rings: 100, 3. 6-membered aromatic rings:50, 4. 5-membered aromatic rings:28)
Roche set
Acyclic aliphatic: 61, 2. aliphatic rings: 21, 3. 6-membered aromatic rings:5, 4. 5-membered aromatic rings:11
Coverage set
Acyclic aliphatic: 76, 2. aliphatic rings: 2, 3. 6-membered aromatic rings:6, 4. 5-membered aromatic rings:2
Pfizer set
Acyclic aliphatic: 24, 2. aliphatic rings: 9, 3. 6-membered aromatic rings:6, 4. 5-membered aromatic rings:7
eMolecules set (okay not to include eMolecules set?)
aliphatic chain: 148, 2. aliphatic rings: 75, 3. 6-membered aromatic rings:90, 4. 5-membered aromatic rings:51
Bayer set
Acyclic aliphatic: 116, 2. aliphatic rings: 86, 3. 6-membered aromatic rings:42, 4. 5-membered aromatic rings:16
2. Generation of molecule set
Using 361 substituents, generated 59086 molecules
3. Curation of molecule set
3.1. Remove similar molecules using MACCS keys fingerprints and Check coverage of torsion parameters
(1) list molecules matching to each torsion parameter
(2) using MACCS keys fingerprints, cluster each molecule list into ~20 clusters
(3) Pick one molecule per each cluster to generate subset of list with around 20 molecules for each torsion parameter
Picking a center molecule(one with the largest sum of similarity indices) or the simplest molecule→ constantly choose certain substituents?
Choosing center molecules
38 substituents out of 361 not included. (26 scaffolds missing)
Choosing simple molecules
65 substituents out of 361 not included. (28 scaffolds missing)
Random picking
~26 substituents out of 361 not included. (27 scaffolds missing)
(5) check coverage of torsion parameter (missing torsions)
→ Generate torsiondrive dataset to submit
3.2. Internal H bond forming mols : Better SMIRKS needed. How to consider spatial arrangement of 1-n chain
(1) Test filtering w/ oversimplified SMIRKS
(2) More specific SMIKRS patterns
Filter
[n,N,o,O,F]([H])[!#1][!#1]~!@[!#1;r]([#7X2;r])
# molecules matched : 1430 (out of 59086)
Right hand side mols dont seem to form internal H bond
Filter
[n,N,o,O,F]([H])[!#1]~!@[!#1]~!@[!#1;r]([#7X2;r])
# molecules matched : 1060 ( <2 % of total)
How to exclude right mol?
TODO (2021-04-01)
damn installation
1. conda create --name constructure -c conda-forge -c openeye -c omnia pydantic openeye-toolkits cmiles ipykernel python=3.8
2. python setup.py develop (constructure)
3. python setup.py install (fragmenter)
4. conda install -c conda-forge pyyaml
* Additionally openforcefield has been installed