2025-05-22 JAC/LW Check-In

Participants

@Lily Wang
@Jennifer Clark

Discussion topics

Notes

Notes

DMultiple Specs

Background: Each records is a unique combination of a molecule-entry and a specification. Everything about the specifications have to be identical for a record to be recognized. Recall the issue we had with QCSubmit where new keywords were added and all the records were duplicated.

The Issue: There are slight variations in specifications used in the datasets that make up a Sage forcefield. The number of unique specifications are:

Opt 2.0.0: 1
TD 2.0.0: 1
Opt 2.1.0: 5
TD 2.1.0: 4

Code to reproduce for TD 2.0.0:

from deepdiff import DeepDiff
from qcportal import PortalClient
from qcportal.serialization import encode_to_json

ADDRESS = "https://api.qcarchive.molssi.org:443/"
client = PortalClient(ADDRESS, cache_dir=".")
file = requests.get(
    "https://raw.githubusercontent.com/openforcefield/sage-2.1.0/8d196aa104f83b8c901d922073ee68b875ae8c32/inputs-and-outputs/data-sets/opt-set-for-fitting-2.1.0.json"
)
data = json.loads(file.content)
entry_dicts = data["entries"][ADDRESS]
dataset_type = entry_dicts[0]["type"]

dataset_names = [
    "OpenFF Gen 2 Opt Set 1 Roche",
    "OpenFF Gen 2 Opt Set 2 Coverage",
    "OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy",
    "OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy",
    "OpenFF Gen 2 Opt Set 5 Bayer",
    "OpenFF Gen2 Optimization Dataset Protomers v1.0",
    "OpenFF Iodine Chemistry Optimization Dataset v1.0",
    "OpenFF Optimization Set 1",
    "SMIRNOFF Coverage Set 1",
    "OpenFF Aniline Para Opt v1.0",
]
dataset_ids = [client.get_dataset(dataset_type, ds_name).id for ds_name in dataset_names]
print(f"We expect our records to come from the following datasets: {dataset_ids}")

record_ids = set([int(x["record_id"]) for x in entry_dicts])
tmp_ds_ids1 = []
wrong_ds1 = defaultdict(list)
for rec_id in record_ids:
    response = client.query_dataset_records(record_id=[rec_id])
    ds_name = None
    for resp in response:
        if resp["dataset_name"] in dataset_names:
            tmp_ds_ids1.append(resp["dataset_name"])
            ds_name = resp["dataset_name"]
    if ds_name is None:
        wrong_ds1[rec_id] = [resp["dataset_name"] for resp in response]
tmp_ds_ids1 = set(tmp_ds_ids1)
print(f"There are {len(wrong_ds1)} records that aren't in the datasets that we expect.")

# __________ Check that all records share a single specification __________
specification_list = []
for rec in records:
    tmp = encode_to_json(rec.specification)
    if all(len(DeepDiff(tmp, x)) > 0 for x in specification_list) or not specification_list:
        specification_list.append(tmp)
        
print(f"These records have {len(specification_list)} unique specifications")

2025-05-22 JAC/LW Check-In

Participants

Discussion topics

Action items

Decisions