Several concerns were raised in a private email by Paul Labute (CCG) about Sage 2.1:
Table of Contents |
---|
t49 vs t84
t49 "*~[#7a]:[#6a:3]~*" in Sage 2.1.- seems to be handled by later t84; should t49 be deleted?
...
Code Block |
---|
# in terminal
git clone git@github.com:openforcefield/sage-2.1.0.git
cd sage-2.1.0/inputs-and-outputs/data-sets/
# in python
from openff.qcsubmit.results import TorsionDriveResultCollection
td_result_collection = TorsionDriveResultCollection.parse_file(
"td-set-for-fitting-2.1.0.json"
) |
Solution
t49 applies to any aromatic CN bond. t84 applies to any CN bond where the N has 2 substituents and the C has 3, or if the N has 3 substituents and one is O. t49 should apply to any molecule which has an aromatic right with a charged N+ in the ring, where the N has 3 substituents but none are O.
We can verify this by inspecting the training dataset to see which molecules need t49 (results are shown in this notebook
View file | ||
---|---|---|
|
...
Therefore, ost aromatic N atoms are treated by t84, but t49 is not redundant, as it applies to aromatic charged N.
Code Block | ||
---|---|---|
| ||
from openff.toolkit import Molecule, ForceField, Topology forcefield = ForceField("openff-2.1.0.offxml") import tqdm from qcportal.client import FractalClient from openff.qcsubmit.results import TorsionDriveResultCollection # Create a client which allows us to connect to the main QCArchive server. qcarchive_client = FractalClient() td_result_collection = TorsionDriveResultCollection.parse_file( "sage-2.1.0/inputs-and-outputs/data-sets/td-set-for-fitting-2.1.0.json" ) records_and_molecules = td_result_collection.to_records() use_t49 = [] # list of molecules that use t49 for _, molecule in tqdm.tqdm(records_and_molecules, desc="checking"): all_labels = forcefield.label_molecules(molecule.to_topology()) for mol_idx, mol_forces in enumerate(all_labels): for force_tag, force_dict in mol_forces.items(): for atom_indices, parameter in force_dict.items(): atomstr = "" for idx in atom_indices: atomstr += "%3s" % idx # for some reason this adds each molecule 4 times for each appearance of t49 if parameter.id == 't49': use_t49.append(molecule.to_smiles()) # Create unique list of molecules that use t49 set_list = set(use_t49) use_t49_unique = list(set_list) # Visualize the molecules Molecule.from_smiles(use_t49_unique[0]) Molecule.from_smiles(use_t49_unique[1]) Molecule.from_smiles(use_t49_unique[2]) |
t123
t123 "[*:1]~[#15:2]-[#6:3]-[*:4]" in Sage 2.1.0 seems entirely contained in t123a and t124 - should t123 be deleted? The V1 value is suspicious too.
...
A good way to tackle this would be the “coverage” approach above – seeing what kind of training data was used for this parameter might explain the relatively steep force constant.
...
Solution
t123 corresponds to a C that is single bonded to a P and at least one other group. t123a corresponds to a C that is single bonded to a P and one other substituent, and has 1 other substituents that can have any kind of bond. t124 corresponds to a C that is single bonded to a P and has 2 other substituents that can have any kind of bond.
It is hard to imagine a molecule that has a C with two single bonds that doesn't have either one or two other substituents (e.g. a double bond or two single bonds), though maybe it could happen in some exotic molecule.
We can verify by checking whether any molecules in the training set use this parameter (results in this notebook
View file | ||
---|---|---|
|
Code Block |
---|
from openff.toolkit import Molecule, ForceField, Topology
forcefield = ForceField("openff-2.1.0.offxml")
import tqdm
from qcportal.client import FractalClient
from openff.qcsubmit.results import TorsionDriveResultCollection
# Create a client which allows us to connect to the main QCArchive server.
qcarchive_client = FractalClient()
td_result_collection = TorsionDriveResultCollection.parse_file(
"sage-2.1.0/inputs-and-outputs/data-sets/td-set-for-fitting-2.1.0.json"
)
records_and_molecules = td_result_collection.to_records()
use_t123 = []
for _, molecule in tqdm.tqdm(records_and_molecules, desc="checking"):
all_labels = forcefield.label_molecules(molecule.to_topology())
for mol_idx, mol_forces in enumerate(all_labels):
for force_tag, force_dict in mol_forces.items():
for atom_indices, parameter in force_dict.items():
atomstr = ""
for idx in atom_indices:
atomstr += "%3s" % idx
# for some reason this adds each molecule 4 times for each appearance of t49
if parameter.id == 't123':
use_t123.append(molecule.to_smiles())
print(use_t123) # empty list |
It appears that no molecules in the training set use t123, which would make it redundant (at least based on this dataset).
It also appears that no molecules in the OpenFF Industry Benchmark Season 1 v1.1 or OpenFF-benchmark-ligand-fragments-v1.0 datasets use t123, as seen in this notebook
View file | ||
---|---|---|
|
t129
Covered 47 times by Sage 2.1.0/td-set-for-fitting.json and by my filtered version and by my filtered version with torsion multiplicity additions. These 47 torsions cover the 13 molecules here:
...
And shown with their SMILES here:
View file | ||
---|---|---|
|
The lingering question is whether or not the large force constant is justified by the torsion potential energy surface.
...
PESs
To answer this, below are the molecules for which the TorsionDriveRecord
dihedrals match the SMIRKS for t129 in the training data set:
...
Each pair of images shows the molecule on the left, and the corresponding torsion drive scan on the right.
View file | ||
---|---|---|
|
From these data the force constant value of -20 kcal/mol looks reasonable and could even be a bit low for these cases.