Missing parameter coverage
- 1 t129
- 1.1 Industry data
- 2 t164
- 2.1 Possible Training Data
- 2.1.1 From ChEBI
- 2.1.2 From ChEMBL
- 2.1.2.1 Not #15X4
- 2.1 Possible Training Data
t129
As reported in the Known issues and bugs (@striketeam), t129 ([*:1]-[#8X2r5:2]-;@[#7X2r5:3]~[*:4]) in Sage 2.1.0 has a suspicious V = -19.907. Based on investigation in Redundant parameters in Sage 2.1, it looks like this force constant is reasonable for the aromatic rings in the training data, but as shown below, this pattern also applies to some non-aromatic rings in the industry dataset, which are a bit more suspicious. We should probably at least add some of these to the training set for Sage 2.2.
Industry data
The final question then is whether this pattern can only apply to aromatic rings like these. The pattern covers 295 molecules in the industry benchmarking data set, as shown in the series of images below.
At least a few of these, shown below, are not aromatic.
Molecule | SMILES |
---|---|
mol00.png | [H]c1c(c(c(c(c1[H])[H])N2C(=NOS2)C3=C(C(=O)Oc4c3c(c(c(c4[H])[H])[H])[H])[H])[H])[H] |
mol16.png | [H]c1c(c(c(c(c1C2=NO[C@@]3([C@]2(C(N(C3([H])[H])C(=O)OC(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])([H])[H])[H])[H])[H])[H])Cl)[H] |
mol24.png | [H]c1c(c(c(c(c1[H])c2c(c3c(c(c2[H])Cl)N(C(=N3)C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])F)[H])[H] |
mol43.png | [H]c1c(c(c(c(c1C(=O)C2=C([N-]N(C2=O)C([H])([H])[H])[H])C([H])([H])[H])C3=NOC(C3([H])[H])([H])[H])S(=O)(=O)C([H])([H])[H])[H] |
mol55.png | [H]c1c(c(c(c(c1[H])c2c(c(c3c(n2)N=C(N3[H])C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])C(F)(F)F)[H])[H] |
mol91.png | [H]c1c(c(c(c(c1[H])c2c(c(c3c(c2C([H])([H])[H])N(C(=N3)C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])C([H])([H])[H])[H])C(F)(F)F)[H])[H] |
mol106.png | [H]c1c(c(c(c(c1[H])c2c(c3c(c(c2[H])Br)N(C(=N3)C4=NOC5(C4([H])[H])C(C(OC(C5([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])C(F)(F)F)[H])[H] |
mol107.png | [H]c1c(c(c(c(c1[H])c2c(c3c(c(c2[H])Cl)N=C(N3[H])C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])(C([H])([H])[H])C([H])([H])[H])([H])[H])([H])[H])[H])C(F)(F)F)[H])[H] |
mol110.png | [H]c1c(c(c2c(c1[H])C3=NOC@([H])C([H])([H])N4C(C(N(C(C4([H])[H])([H])[H])C([H])([H])C([H])([H])[H])([H])[H])([H])[H])[H])[H] |
mol123.png | [H]c1c(c(c(c(c1[H])F)c2c(c(c3c(c2[H])N=C(N3[H])C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])OC(F)(F)F)[H] |
mol139.png | [H]c1c(c(c(c(c1[H])[H])[C@]2(C(C(=NO2)c3c(c(c4c(c3[H])N=C(N(C4=O)[H])[H])[H])[H])([H])[H])C([H])([H])[H])[H])[H] |
mol156.png | [H]c1c(c(c(c(c1[H])c2c(c(c3c(c2[H])N(C(=N3)C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])C(F)(F)F)[H])Cl)[H])[H] |
mol185.png | [H]c1c(c(c(c(c1[H])[H])[C@@]2(C(C(=NO2)c3c(c(c4c(c3[H])C(C(C(=C4[H])[H])([H])[H])([H])[H])[H])[H])([H])[H])[H])[H])[H] |
mol199.png | [H]c1c(c(c(c(c1[H])[H])C2=NOC@([H])c3c(c(c4c(c3[H])C(=O)N(C(=N4)[H])[H])[H])[H])[H])[H] |
mol223.png | [H]c1c(c(c(c(c1C2=NOC@@([H])C3=Nc4c(c(c(c(c4C(=O)O3)[H])Cl)[H])[H])[H])[H])Cl)[H] |
mol239.png | [H]c1c(c(c(c(c1[H])c2c(c(c3c(c2[H])N=C(N3[H])C4=NOC5(C4([H])[H])C(C(OC(C5([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])Cl)[H])[H] |
t164
t164 is only covered by these three conformations of one molecule: [H]C1=C(N(C(=C1[H])C(=S)N=P(N(C([H])([H])[H])C([H])([H])[H])(N(C([H])([H])[H])C([H])([H])[H])N(C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])[H]
It also doesn’t apply to any molecules in the industry dataset (both by my checks and as mentioned in the Sage 2.0 paper), so it either needs much more training data, or, more likely, it needs to be refined and/or separated into multiple parameters.
Possible Training Data
From ChEBI
Name | SMILES | Image |
---|---|---|
trimethyl(phenylimino)phosphorane | P(=NC1=CC=CC=C1)(C)(C)C |
|
P,P-diphenylphosphinimidic amide | NP(=N)(c1ccccc1)c1ccccc1 |
|
N,N',P,P-tetraphenylphosphinimidic amide | N(c1ccccc1)P(=Nc1ccccc1)(c1ccccc1)c1ccccc1 |
|
apholate | C1CN1P1(=NP(=NP(=N1)(N1CC1)N1CC1)(N1CC1)N1CC1)N1CC1 |
|
phosphenodiimidic amide | P(N)(=N)=N |
|
hexakis(2,2,3,3-tetrafluoropropoxy)cyclotriphosphazene | FC(F)C(F)(F)COP1(OCC(F)(F)C(F)F)=NP(OCC(F)(F)C(F)F)(OCC(F)(F)C(F)F)=NP(OCC(F)(F)C(F)F)(OCC(F)(F)C(F)F)=N1 |
|
From ChEMBL
These SMILES should be in the same order as the images below.
Not #15X4
Only two of these molecules, shown below, involve a nitrogen-phosphorus double bond, where P does not have 4 substituents. So it seems the original bug report suggesting #15X4 as more appropriate may be correct, but this chemistry is very rare.