Dataset design for a simple fit

To fit the torsion of the central bridging bond of a very specific bi-aryl series with no ortho substituents (to avoid steric effects)

Smarts iterations:

iter0: "[#6X3aH1:1]~[#6X3aH1:2]-[#6X3aH1:3]:-[#6X3aH1:4]" 
  iter1: "[#6X3aH1:1]~[#6X3a:2](~[#6X3aH1])-[#6X3a:3](~[#6X3aH1])~[#6X3aH1:4]" 
    iter2: "[#6X3H1:1]~[#6X3:2](~[#6X3H1])-[#6X3:3](~[#6X3H1])~[#6X3H1:4]"

iter0: doesn’t enforce no-ortho substituents

iter1: enforces no-ortho but the rings should be strictly aromatic within MDL aromaticity model (which causes issues with pyrroles and others)

iter2: covers other aromatic rings like pyrrole, pyridazine, furan, thiophene, etc.

Around 28 molecules match this pattern from QCA datasets and all of the wbo values are falling around 1.01 (except one with 1.3), the QCA torsiondrive ids are ['18536057', '18886238', '1762109', '21272381', '21272382', '21272387', '21272388', '21272390', '21272397', '21272401', '21272416', '21272430', '21272431', '21272436', '4269703', '4269704', '4269705', '4269706', '4269711', '21540395', '21540558', '21540569', '21540577', '21540578', '21540582', '21540585', '21540588', '21540589'].

Here’s an attempt to improve the range of wbos covered using Simon’s constructure and different scaffolds and substituents. With lot of electron donating (EDG) and withdrawing groups (EWG) most of the wbo values of the central bond in torsion are still falling around 1.01 skewing the dataset, so choosing a conservative set of strongly electron donating and electron withdrawing substituents here is a minimal set:

Histogram of wbo values -Updated- (barring a few molecules which have issues in Omega conformer generation, will update later if there’s any change):

Jupyter notebook: