Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 7 Next »

Several concerns were raised in a private email by Paul Labute (CCG) about Sage 2.1:

t49 vs t84

t49 "*~[#7a]:[#6a:3]~*" in Sage 2.1.- seems to be handled by later t84; should t49 be deleted?

t84 is the below:

"[*:1]~[#7X2,#7X3$(*~[#8X1]):2]:[#6X3:3]~[*:4]"

In SMIRNOFF, later parameters (i.e. t84) “override” earlier ones. If we can find an example molecule that t49 is applied to, we should keep it; if not, it’s taking up space.

An example of checking torsions is in the code snippet below:

from openff.toolkit import Molecule, ForceField
sage = ForceField("openff-2.1.0.offxml")
molecule = Molecule.from_smiles("c1cncnc1", allow_undefined_stereo=True)
all_labels = sage.label_molecules(molecule.to_topology())[0]
torsions = all_labels["ProperTorsions"]
for torsion in torsions.values():
    print(torsion.id)

From looking at the pattern, a charged aromatic nitrogen with a non-Oxygen substituent might be what t49 is applicable to.

A potential solution is iterate through the training datasets from QCArchive to iterate through all molecules to see if there are matches. Getting the coverage of the parameter would be generally interesting in seeing what it applies to vs. t84.

To download from QCArchive:

# Create a client which allows us to connect to the main QCArchive server.
qcarchive_client = FractalClient()

# Retrieve the data set containing the molecules of interest.
from openff.qcsubmit.results import TorsionDriveResultCollection

td_result_collection = TorsionDriveResultCollection.from_server(
    client=qcarchive_client,
    datasets=[
        "OpenFF Gen 2 Torsion Set 1 Roche 2",
        "OpenFF Gen 2 Torsion Set 2 Coverage 2",
        "OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy 2",
        "OpenFF Gen 2 Torsion Set 4 eMolecules Discrepancy 2",
        "OpenFF Gen 2 Torsion Set 5 Bayer 2",
        "OpenFF Gen 2 Torsion Set 6 supplemental 2",
    ],
    spec_name="default"
)

# tqdm is for progress bars -- very useful
records_and_molecules = td_result_collection.to_records()
for _, molecule in tqdm.tqdm(records_and_molecules, desc="checking"):
    all_labels = sage.label_molecules(molecule.to_topology())[0]

However, given that downloading is liable to take a very long time and that Pavan has already put up the records he uses online at , we can just download that instead. (This will be larger than the collection above, as Pavan added additional torsions for training).

# in terminal
git clone git@github.com:openforcefield/sage-2.1.0.git
cd sage-2.1.0/inputs-and-outputs/data-sets/

# in python
from openff.qcsubmit.results import TorsionDriveResultCollection
td_result_collection = TorsionDriveResultCollection.parse_file(
    "td-set-for-fitting-2.1.0.json"
)

Solution

t49 applies to any aromatic CN bond. t84 applies to any CN bond where the N has 2 substituents and the C has 3, or if the N has 3 substituents and one is O. t49 should apply to any molecule which has an aromatic right with a charged N+ in the ring, where the N has 3 substituents but none are O.

We can verify this by inspecting the training dataset to see which molecules need t49 (results are shown in this notebook ):

from openff.toolkit import Molecule, ForceField, Topology
forcefield = ForceField("openff-2.1.0.offxml")
import tqdm
from qcportal.client import FractalClient
from openff.qcsubmit.results import TorsionDriveResultCollection

# Create a client which allows us to connect to the main QCArchive server.
qcarchive_client = FractalClient()

td_result_collection = TorsionDriveResultCollection.parse_file(
    "sage-2.1.0/inputs-and-outputs/data-sets/td-set-for-fitting-2.1.0.json"
)

records_and_molecules = td_result_collection.to_records()

use_t49 = [] # list of molecules that use t49

for _, molecule in tqdm.tqdm(records_and_molecules, desc="checking"):
    all_labels = forcefield.label_molecules(molecule.to_topology())
    for mol_idx, mol_forces in enumerate(all_labels):
        for force_tag, force_dict in mol_forces.items():
            for atom_indices, parameter in force_dict.items():
                atomstr = ""
                for idx in atom_indices:
                    atomstr += "%3s" % idx
                # for some reason this adds each molecule 4 times for each appearance of t49
                if parameter.id == 't49': 
                    use_t49.append(molecule.to_smiles())

# Create unique list of molecules that use t49
set_list = set(use_t49)
use_t49_unique = list(set_list)

# Visualize the molecules
Molecule.from_smiles(use_t49_unique[0])
Molecule.from_smiles(use_t49_unique[1])
Molecule.from_smiles(use_t49_unique[2])

t123

t123 "[*:1]~[#15:2]-[#6:3]-[*:4]" in Sage 2.1.0 seems entirely contained in t123a and t124 - should t123 be deleted?  The V1 value is suspicious too.

Parameters:

( per is short for "periodicity", ph is short for “phase”. k values have been truncated to the 6th decimal place for conciseness. They’re in kcal/mol. The phase is in degrees. The torsional term has the functional form k*(1+cos(periodicity*theta-phase)))

ID

SMIRKS

per1

ph1

k1

per2

ph2

k2

t123

"[:1]~[#15:2]-[#6:3]-[:4]"

1

0

-10.84539

t123a

"[:1]~[#15:2]-[#6X4:3]-[:4]"

3

0

0.112496

t124

"[:1]~[#15:2]-[#6X3:3]~[:4]"

2

0

-2.188333

3

0

0.281732

A good way to tackle this would be the “coverage” approach above – seeing what kind of training data was used for this parameter might explain the relatively steep force constant.

Solution

t123 corresponds to a C that is single bonded to a P and at least one other group. t123a corresponds to a C that is single bonded to a P and one other substituent, and has 1 other substituents that can have any kind of bond. t124 corresponds to a C that is single bonded to a P and has 2 other substituents that can have any kind of bond.

It is hard to imagine a molecule that has a C with two single bonds that doesn't have either one or two other substituents (e.g. a double bond or two single bonds), though maybe it could happen in some exotic molecule.

We can verify by checking whether any molecules in the training set use this parameter (results in this notebook ):

from openff.toolkit import Molecule, ForceField, Topology
forcefield = ForceField("openff-2.1.0.offxml")
import tqdm
from qcportal.client import FractalClient
from openff.qcsubmit.results import TorsionDriveResultCollection

# Create a client which allows us to connect to the main QCArchive server.
qcarchive_client = FractalClient()

td_result_collection = TorsionDriveResultCollection.parse_file(
    "sage-2.1.0/inputs-and-outputs/data-sets/td-set-for-fitting-2.1.0.json"
)
records_and_molecules = td_result_collection.to_records()

use_t123 = []

for _, molecule in tqdm.tqdm(records_and_molecules, desc="checking"):
    all_labels = forcefield.label_molecules(molecule.to_topology())
    for mol_idx, mol_forces in enumerate(all_labels):
        for force_tag, force_dict in mol_forces.items():
            for atom_indices, parameter in force_dict.items():
                atomstr = ""
                for idx in atom_indices:
                    atomstr += "%3s" % idx

                # for some reason this adds each molecule 4 times for each appearance of t49
                if parameter.id == 't123': 
                    use_t123.append(molecule.to_smiles())

print(use_t123) # empty list

It appears that no molecules in the training set use t123, which would make it redundant (at least based on this dataset).

It also appears that no molecules in the OpenFF Industry Benchmark Season 1 v1.1 or OpenFF-benchmark-ligand-fragments-v1.0 datasets use t123, as seen in this notebook .

t129

Covered 47 times by Sage 2.1.0/td-set-for-fitting.json and by my filtered version and by my filtered version with torsion multiplicity additions. These 47 torsions cover the 13 molecules here:

And shown with their SMILES here: .

The only lingering question is whether or not the SMARTS pattern restricts the torsion to aromatic rings of this form, where the large force constant might be reasonable, or if it applies anywhere else.

Industry data

The pattern covers 295 molecules in the industry benchmarking data set, as shown in the series of images below:

TODO: are all of these aromatic rings?

TODO: plot PESs for these torsions to check magnitude of the force constant

  • No labels