Missing parameter coverage

t129

As reported in the Known issues and bugs (@striketeam), t129 ([*:1]-[#8X2r5:2]-;@[#7X2r5:3]~[*:4]) in Sage 2.1.0 has a suspicious V = -19.907.  Based on investigation in Redundant parameters in Sage 2.1, it looks like this force constant is reasonable for the aromatic rings in the training data, but as shown below, this pattern also applies to some non-aromatic rings in the industry dataset, which are a bit more suspicious. We should probably at least add some of these to the training set for Sage 2.2.

Industry data

The final question then is whether this pattern can only apply to aromatic rings like these. The pattern covers 295 molecules in the industry benchmarking data set, as shown in the series of images below.

At least a few of these, shown below, are not aromatic.

Molecule

SMILES

Molecule

SMILES

mol00.png

[H]c1c(c(c(c(c1[H])[H])N2C(=NOS2)C3=C(C(=O)Oc4c3c(c(c(c4[H])[H])[H])[H])[H])[H])[H]

mol16.png

[H]c1c(c(c(c(c1C2=NO[C@@]3([C@]2(C(N(C3([H])[H])C(=O)OC(C([H])([H])[H])(C([H])([H])[H])C([H])([H])[H])([H])[H])[H])[H])[H])[H])Cl)[H]

mol24.png

[H]c1c(c(c(c(c1[H])c2c(c3c(c(c2[H])Cl)N(C(=N3)C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])F)[H])[H]

mol43.png

[H]c1c(c(c(c(c1C(=O)C2=C([N-]N(C2=O)C([H])([H])[H])[H])C([H])([H])[H])C3=NOC(C3([H])[H])([H])[H])S(=O)(=O)C([H])([H])[H])[H]

mol55.png

[H]c1c(c(c(c(c1[H])c2c(c(c3c(n2)N=C(N3[H])C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])C(F)(F)F)[H])[H]

mol91.png

[H]c1c(c(c(c(c1[H])c2c(c(c3c(c2C([H])([H])[H])N(C(=N3)C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])C([H])([H])[H])[H])C(F)(F)F)[H])[H]

mol106.png

[H]c1c(c(c(c(c1[H])c2c(c3c(c(c2[H])Br)N(C(=N3)C4=NOC5(C4([H])[H])C(C(OC(C5([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])C(F)(F)F)[H])[H]

mol107.png

[H]c1c(c(c(c(c1[H])c2c(c3c(c(c2[H])Cl)N=C(N3[H])C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])(C([H])([H])[H])C([H])([H])[H])([H])[H])([H])[H])[H])C(F)(F)F)[H])[H]

mol110.png

[H]c1c(c(c2c(c1[H])C3=NOC@([H])C([H])([H])N4C(C(N(C(C4([H])[H])([H])[H])C([H])([H])C([H])([H])[H])([H])[H])([H])[H])[H])[H]

mol123.png

[H]c1c(c(c(c(c1[H])F)c2c(c(c3c(c2[H])N=C(N3[H])C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])OC(F)(F)F)[H]

mol139.png

[H]c1c(c(c(c(c1[H])[H])[C@]2(C(C(=NO2)c3c(c(c4c(c3[H])N=C(N(C4=O)[H])[H])[H])[H])([H])[H])C([H])([H])[H])[H])[H]

mol156.png

[H]c1c(c(c(c(c1[H])c2c(c(c3c(c2[H])N(C(=N3)C4=NOC5(C4([H])[H])C(C(C(C(C5([H])[H])([H])[H])([H])[H])([H])[H])([H])[H])[H])C(F)(F)F)[H])Cl)[H])[H]

mol185.png

[H]c1c(c(c(c(c1[H])[H])[C@@]2(C(C(=NO2)c3c(c(c4c(c3[H])C(C(C(=C4[H])[H])([H])[H])([H])[H])[H])[H])([H])[H])[H])[H])[H]

mol199.png

[H]c1c(c(c(c(c1[H])[H])C2=NOC@([H])c3c(c(c4c(c3[H])C(=O)N(C(=N4)[H])[H])[H])[H])[H])[H]

mol223.png

[H]c1c(c(c(c(c1C2=NOC@@([H])C3=Nc4c(c(c(c(c4C(=O)O3)[H])Cl)[H])[H])[H])[H])Cl)[H]

mol239.png

[H]c1c(c(c(c(c1[H])c2c(c(c3c(c2[H])N=C(N3[H])C4=NOC5(C4([H])[H])C(C(OC(C5([H])[H])([H])[H])([H])[H])([H])[H])[H])[H])Cl)[H])[H]

t164

t164 is only covered by these three conformations of one molecule: [H]C1=C(N(C(=C1[H])C(=S)N=P(N(C([H])([H])[H])C([H])([H])[H])(N(C([H])([H])[H])C([H])([H])[H])N(C([H])([H])[H])C([H])([H])[H])C([H])([H])[H])[H]

It also doesn’t apply to any molecules in the industry dataset (both by my checks and as mentioned in the Sage 2.0 paper), so it either needs much more training data, or, more likely, it needs to be refined and/or separated into multiple parameters.

Possible Training Data

From ChEBI

Name

SMILES

Image

Name

SMILES

Image

trimethyl(phenylimino)phosphorane

P(=NC1=CC=CC=C1)(C)(C)C

 

P,P-diphenylphosphinimidic amide

NP(=N)(c1ccccc1)c1ccccc1

 

N,N',P,P-tetraphenylphosphinimidic amide

N(c1ccccc1)P(=Nc1ccccc1)(c1ccccc1)c1ccccc1

 

apholate

C1CN1P1(=NP(=NP(=N1)(N1CC1)N1CC1)(N1CC1)N1CC1)N1CC1

 

phosphenodiimidic amide

P(N)(=N)=N

 

hexakis(2,2,3,3-tetrafluoropropoxy)cyclotriphosphazene

FC(F)C(F)(F)COP1(OCC(F)(F)C(F)F)=NP(OCC(F)(F)C(F)F)(OCC(F)(F)C(F)F)=NP(OCC(F)(F)C(F)F)(OCC(F)(F)C(F)F)=N1

 

From ChEMBL

These SMILES should be in the same order as the images below.

Not #15X4

Only two of these molecules, shown below, involve a nitrogen-phosphorus double bond, where P does not have 4 substituents. So it seems the original bug report suggesting #15X4 as more appropriate may be correct, but this chemistry is very rare.