2025-05-22 JAC/LW Check-In

2025-05-22 JAC/LW Check-In

Participants

  • @Lily Wang

  • @Jennifer Clark

Discussion topics

Notes

Notes

DMultiple Specs

Background: Each records is a unique combination of a molecule-entry and a specification. Everything about the specifications have to be identical for a record to be recognized. Recall the issue we had with QCSubmit where new keywords were added and all the records were duplicated.

The Issue: There are slight variations in specifications used in the datasets that make up a Sage forcefield. The number of unique specifications are:

  • Opt 2.0.0: 1

  • TD 2.0.0: 1

  • Opt 2.1.0: 5

  • TD 2.1.0: 4

Code to reproduce for TD 2.0.0:

from deepdiff import DeepDiff from qcportal import PortalClient from qcportal.serialization import encode_to_json ADDRESS = "https://api.qcarchive.molssi.org:443/" client = PortalClient(ADDRESS, cache_dir=".") file = requests.get( "https://raw.githubusercontent.com/openforcefield/sage-2.1.0/8d196aa104f83b8c901d922073ee68b875ae8c32/inputs-and-outputs/data-sets/opt-set-for-fitting-2.1.0.json" ) data = json.loads(file.content) entry_dicts = data["entries"][ADDRESS] dataset_type = entry_dicts[0]["type"] dataset_names = [ "OpenFF Gen 2 Opt Set 1 Roche", "OpenFF Gen 2 Opt Set 2 Coverage", "OpenFF Gen 2 Opt Set 3 Pfizer Discrepancy", "OpenFF Gen 2 Opt Set 4 eMolecules Discrepancy", "OpenFF Gen 2 Opt Set 5 Bayer", "OpenFF Gen2 Optimization Dataset Protomers v1.0", "OpenFF Iodine Chemistry Optimization Dataset v1.0", "OpenFF Optimization Set 1", "SMIRNOFF Coverage Set 1", "OpenFF Aniline Para Opt v1.0", ] dataset_ids = [client.get_dataset(dataset_type, ds_name).id for ds_name in dataset_names] print(f"We expect our records to come from the following datasets: {dataset_ids}") record_ids = set([int(x["record_id"]) for x in entry_dicts]) tmp_ds_ids1 = [] wrong_ds1 = defaultdict(list) for rec_id in record_ids: response = client.query_dataset_records(record_id=[rec_id]) ds_name = None for resp in response: if resp["dataset_name"] in dataset_names: tmp_ds_ids1.append(resp["dataset_name"]) ds_name = resp["dataset_name"] if ds_name is None: wrong_ds1[rec_id] = [resp["dataset_name"] for resp in response] tmp_ds_ids1 = set(tmp_ds_ids1) print(f"There are {len(wrong_ds1)} records that aren't in the datasets that we expect.") # __________ Check that all records share a single specification __________ specification_list = [] for rec in records: tmp = encode_to_json(rec.specification) if all(len(DeepDiff(tmp, x)) > 0 for x in specification_list) or not specification_list: specification_list.append(tmp) print(f"These records have {len(specification_list)} unique specifications")

 

Action items

Decisions