Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Code Block
# in terminal
git clone git@github.com:openforcefield/sage-2.1.0.git
cd sage-2.1.0/inputs-and-outputs/data-sets/

# in python
from openff.qcsubmit.results import TorsionDriveResultCollection
td_result_collection = TorsionDriveResultCollection.parse_file(
    "td-set-for-fitting-2.1.0.json"
)

Solution

t49 applies to any aromatic CN bond. t84 applies to any CN bond where the N has 2 substituents and the C has 3, or if the N has 3 substituents and one is O. t49 should apply to any molecule which has an aromatic right with a charged N+ in the ring, where the N has 3 substituents but none are O.

We can verify this by inspecting the training dataset to see which molecules need t49:

Code Block
languagepy
from openff.toolkit import Molecule, ForceField, Topology
forcefield = ForceField("openff-2.1.0.offxml")
import tqdm
from qcportal.client import FractalClient
from openff.qcsubmit.results import TorsionDriveResultCollection

# Create a client which allows us to connect to the main QCArchive server.
qcarchive_client = FractalClient()

td_result_collection = TorsionDriveResultCollection.parse_file(
    "sage-2.1.0/inputs-and-outputs/data-sets/td-set-for-fitting-2.1.0.json"
)

records_and_molecules = td_result_collection.to_records()

use_t49 = [] # list of molecules that use t49

for _, molecule in tqdm.tqdm(records_and_molecules, desc="checking"):
    all_labels = forcefield.label_molecules(molecule.to_topology())
    for mol_idx, mol_forces in enumerate(all_labels):
        for force_tag, force_dict in mol_forces.items():
            for atom_indices, parameter in force_dict.items():
                atomstr = ""
                for idx in atom_indices:
                    atomstr += "%3s" % idx
                # for some reason this adds each molecule 4 times for each appearance of t49
                if parameter.id == 't49': 
                    use_t49.append(molecule.to_smiles())

# Create unique list of molecules that use t49
set_list = set(use_t49)
use_t49_unique = list(set_list)

# Visualize the molecules
Molecule.from_smiles(use_t49_unique[0])
Molecule.from_smiles(use_t49_unique[1])
Molecule.from_smiles(use_t49_unique[2])

t123

t123 "[*:1]~[#15:2]-[#6:3]-[*:4]" in Sage 2.1.0 seems entirely contained in t123a and t124 - should t123 be deleted?  The V1 value is suspicious too.

...

A good way to tackle this would be the “coverage” approach above – seeing what kind of training data was used for this parameter might explain the relatively steep force constant.

...

Solution

t123 corresponds to a C that is single bonded to a P and at least one other group. t123a corresponds to a C that is single bonded to a P and one other substituent, and has 1 other substituents that can have any kind of bond. t124 corresponds to a C that is single bonded to a P and has 2 other substituents that can have any kind of bond.

It is hard to imagine a molecule that has a C with two single bonds that doesn't have either one or two other substituents (e.g.

...

a double bond or two single bonds), though maybe it could happen in some exotic molecule.

We can verify by checking whether any molecules in the training set use this parameter:

Code Block
from openff.toolkit import Molecule, ForceField, Topology
forcefield = ForceField("openff-2.1.0.offxml")
import tqdm
from qcportal.client import FractalClient
from openff.qcsubmit.results import TorsionDriveResultCollection

# Create a client which allows us to connect to the main QCArchive server.
qcarchive_client = FractalClient()

td_result_collection = TorsionDriveResultCollection.parse_file(
    "sage-2.1.0/inputs-and-outputs/data-sets/td-set-for-fitting-2.1.0.json"
)
records_and_molecules = td_result_collection.to_records()

use_t123 = []

for _, molecule in tqdm.tqdm(records_and_molecules, desc="checking"):
    all_labels = forcefield.label_molecules(molecule.to_topology())
    for mol_idx, mol_forces in enumerate(all_labels):
        for force_tag, force_dict in mol_forces.items():
            for atom_indices, parameter in force_dict.items():
                atomstr = ""
                for idx in atom_indices:
                    atomstr += "%3s" % idx

                # for some reason this adds each molecule 4 times for each appearance of t49
                if parameter.id == 't123': 
                    use_t123.append(molecule.to_smiles())

print(use_t123) # empty list

It appears that no molecules in the training set use t123, which would make it redundant (at least based on this dataset).