2025-05-22 JAC/LW Check-In
2025-05-22 JAC/LW Check-In
Participants
@Lily Wang
@Jennifer Clark
Discussion topics
Notes |
---|
DMultiple Specs Background: Each records is a unique combination of a molecule-entry and a specification. Everything about the specifications have to be identical for a record to be recognized. Recall the issue we had with QCSubmit where new keywords were added and all the records were duplicated. The Issue: There are slight variations in specifications used in the datasets that make up a Sage forcefield. The number of unique specifications are:
|
Code to reproduce for TD 2.0.0: from deepdiff import DeepDiff
from qcportal import PortalClient
from qcportal.serialization import encode_to_json
ADDRESS = "https://api.qcarchive.molssi.org:443/"
client = PortalClient(ADDRESS, cache_dir=".")
file = requests.get(
"https://raw.githubusercontent.com/openforcefield/sage-2.1.0/8d196aa104f83b8c901d922073ee68b875ae8c32/inputs-and-outputs/data-sets/opt-set-for-fitting-2.1.0.json"
)
data = json.loads(file.content)
entry_dicts = data["entries"][ADDRESS]
dataset_type = entry_dicts[0]["type"]
dataset_names = [
"OpenFF Gen 2 Opt Set 1 Roche",
"OpenFF Gen 2 Opt Set 2 Coverage",
"OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy",
"OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy",
"OpenFF Gen 2 Opt Set 5 Bayer",
"OpenFF Gen2 Optimization Dataset Protomers v1.0",
"OpenFF Iodine Chemistry Optimization Dataset v1.0",
"OpenFF Optimization Set 1",
"SMIRNOFF Coverage Set 1",
"OpenFF Aniline Para Opt v1.0",
]
dataset_ids = [client.get_dataset(dataset_type, ds_name).id for ds_name in dataset_names]
print(f"We expect our records to come from the following datasets: {dataset_ids}")
record_ids = set([int(x["record_id"]) for x in entry_dicts])
tmp_ds_ids1 = []
wrong_ds1 = defaultdict(list)
for rec_id in record_ids:
response = client.query_dataset_records(record_id=[rec_id])
ds_name = None
for resp in response:
if resp["dataset_name"] in dataset_names:
tmp_ds_ids1.append(resp["dataset_name"])
ds_name = resp["dataset_name"]
if ds_name is None:
wrong_ds1[rec_id] = [resp["dataset_name"] for resp in response]
tmp_ds_ids1 = set(tmp_ds_ids1)
print(f"There are {len(wrong_ds1)} records that aren't in the datasets that we expect.")
# __________ Check that all records share a single specification __________
specification_list = []
for rec in records:
tmp = encode_to_json(rec.specification)
if all(len(DeepDiff(tmp, x)) > 0 for x in specification_list) or not specification_list:
specification_list.append(tmp)
print(f"These records have {len(specification_list)} unique specifications") |
|
Action items
Decisions
, multiple selections available,