Redundant parameters in Sage 2.1

Several concerns were raised in a private email by Paul Labute (CCG) about Sage 2.1:

t49 vs t84

t49 "*~[#7a]:[#6a:3]~*" in Sage 2.1.- seems to be handled by later t84; should t49 be deleted?

t84 is the below:

"[*:1]~[#7X2,#7X3$(*~[#8X1]):2]:[#6X3:3]~[*:4]"

In SMIRNOFF, later parameters (i.e. t84) “override” earlier ones. If we can find an example molecule that t49 is applied to, we should keep it; if not, it’s taking up space.

An example of checking torsions is in the code snippet below:

from openff.toolkit import Molecule, ForceField
sage = ForceField("openff-2.1.0.offxml")
molecule = Molecule.from_smiles("c1cncnc1", allow_undefined_stereo=True)
all_labels = sage.label_molecules(molecule.to_topology())[0]
torsions = all_labels["ProperTorsions"]
for torsion in torsions.values():
    print(torsion.id)

From looking at the pattern, a charged aromatic nitrogen with a non-Oxygen substituent might be what t49 is applicable to.

A potential solution is iterate through the training datasets from QCArchive to iterate through all molecules to see if there are matches. Getting the coverage of the parameter would be generally interesting in seeing what it applies to vs. t84.

To download from QCArchive:

# Create a client which allows us to connect to the main QCArchive server.
qcarchive_client = FractalClient()

# Retrieve the data set containing the molecules of interest.
from openff.qcsubmit.results import TorsionDriveResultCollection

td_result_collection = TorsionDriveResultCollection.from_server(
    client=qcarchive_client,
    datasets=[
        "OpenFF Gen 2 Torsion Set 1 Roche 2",
        "OpenFF Gen 2 Torsion Set 2 Coverage 2",
        "OpenFF Gen 2 Torsion Set 3 Pfizer Discrepancy 2",
        "OpenFF Gen 2 Torsion Set 4 eMolecules Discrepancy 2",
        "OpenFF Gen 2 Torsion Set 5 Bayer 2",
        "OpenFF Gen 2 Torsion Set 6 supplemental 2",
    ],
    spec_name="default"
)

# tqdm is for progress bars -- very useful
records_and_molecules = td_result_collection.to_records()
for _, molecule in tqdm.tqdm(records_and_molecules, desc="checking"):
    all_labels = sage.label_molecules(molecule.to_topology())[0]

However, given that downloading is liable to take a very long time and that Pavan has already put up the records he uses online at , we can just download that instead. (This will be larger than the collection above, as Pavan added additional torsions for training).

# in terminal
git clone git@github.com:openforcefield/sage-2.1.0.git
cd sage-2.1.0/inputs-and-outputs/data-sets/

# in python
from openff.qcsubmit.results import TorsionDriveResultCollection
td_result_collection = TorsionDriveResultCollection.parse_file(
    "td-set-for-fitting-2.1.0.json"
)

t123

t123 "[*:1]~[#15:2]-[#6:3]-[*:4]" in Sage 2.1.0 seems entirely contained in t123a and t124 - should t123 be deleted? The V1 value is suspicious too.

Parameters:

( per is short for "periodicity", ph is short for “phase”. k values have been truncated to the 6th decimal place for conciseness. They’re in kcal/mol. The phase is in degrees. The torsional term has the functional form k*(1+cos(periodicity*theta-phase)))

ID	SMIRKS	per1	k1	per2	ph2	k2
t123	`"[:1]~[#15:2]-[#6:3]-[:4]"`	1	-10.84539
t123a	`"[:1]~[#15:2]-[#6X4:3]-[:4]"`	3	0.112496
t124	`"[:1]~[#15:2]-[#6X3:3]~[:4]"`	2	-2.188333	3	0	0.281732

A good way to tackle this would be the “coverage” approach above – seeing what kind of training data was used for this parameter might explain the relatively steep force constant.