/
To QCA-Dataset-Submission or Not to QCA-Dataset-Submission w/ TM

To QCA-Dataset-Submission or Not to QCA-Dataset-Submission w/ TM

Motivation
We want to change our QCA-Dataset-Submission (QDS) repo to optionally use QCFractal datasets directly as a bypass to QCSubmit, specifically because of the reliance of QCSubmit on tookits (RDKit and OpenEye) that don't support TM complexes reliably. More data that we like ends up being discarded in this route.

What Do We Want
A way to export a file from a QCFractal dataset that is "in progress" into a single file. This file could then be easily imported directly to restore the dataset for continued editing and eventual submission.

Why We Have QCA-Dataset-Submission and QCSubmit
Right now QCSubmit will organize the data into a QCSubmit dataset that is not tied to a QCPortal client. This allows use to make, manipulate, validate, and review a QCSubmit dataset  in a QDS PR before committing to submitted online to QCPortal. A dataset.json file is exported into a PR which is readily imported into QCSubmit to regenerate the QCSubmit dataset object, which is then submitted to QCPortal by the QDS CI after the PR is merged.

Why Not Just Avoid QDS?
We can and will if you don't think this is possible. We use QDS to:

  • Record our data and how it was produced
            We could just make a placeholder PR that doesn't function with the CI

  • The CI also does error cycling to restore records to "waiting" from "error" to ensure that rerunning won't resolve issues.
            I can do this manually

  • The CI has MW based retagging features
            I can do this manually

  • The CI offers the ability for our collaborators to follow the progress of the dataset
            The collaborators can wait for me to update them. (edited) 

Planning Next Steps

Tracking Transition Metal datasets with QCA-Dataset-Submission would be ideal. There are two major stages to consider: validation and lifecycle.

The discussion below offers two major solutions, in addition to some quick fixes. The major longer term solutions are:

  1. Update QDS to have conversion functions the make QCSubmit style dataset*.json* json files from QCFractal datasets without using QCSubmit, and the reverse.

    • Ben suggests this route but in more of a QCA focused format. To resolve concerns with mixing up dataset*.json files imported by QCSubmit in our CI we can name these file, scaffold*.json.

    • Ben suggests that we can add a module qcportal.external.scaffold with to_json and from_json to achieve this task, since these would be of general use but neither of us needs to commit to hosting them in our main repos.
      See PR

    • Note that these jsons will reform QCF objects, but aren’t quite the same as QCSchema jsons, the two deviate for QCF convenience purposes.

    • Note that I can also add from_hdf5 and to_hdf5 to qcportal.external.scaffold for handling Chris’s data/needs.

  2. Update QDS to allow sqlite file detection and handling in lieu of dataset*.json* files.
    At first glance, it seems like sqlite files would make the most sense. Here is a dump of what I've considered:

    1. 🚫 Views: However, views require that the dataset be on a client to create and download. The resulting view then has several limitations of what you can do with it. Rightly so! being able to manipulate it offline would be opening the door to a mess, but that means that views aren't as useful.

    2. 🚫 Cache: Still requires that a dataset be submitted so that an id is assigned which is not desired. Would have to write some hacky code to abuse this feature.

    3. ⚠️ Have a python file that creates a QCFractal dataset locally and then submits to a Snowflake server, this can be easily over written. We can then make a dataset view here and "download" it into our PR.
      We can then update our QDS CI to detect the sqlite view, import it, and then copy the metadata, specs, and entry objects individually to form a new QCPortal dataset and submit it.
      Ben says this is an option.

Validation

QCSubmit is deeply ingrained into this stage, where we previously decided that Submit cannot be used with Transition Metal complexes. If we want to use this stage of QDSS we have two options:

  1. ⚠️✅ Add a detection of a validation-off label on the PR and if so, skip the CI. This can be accomplished with:

    jobs: dataset_validation: if: ${{ ! contains(github.event.issue.labels.*.name, 'validation-off') }} ...
  2. 🚫 Alter the validation CI to either accept:

    1. Both OpenFF (as dataset.json.bz2) and QCFractal datasets (as sqlite or make function to convert to dataset.json.bz2, which I almost have complete)

    2. Only QCFractal datasets (as sqlite or make function to convert to dataset.json.bz2, which I almost have complete)

      • More difficult, and since this stage uses validation with toolkits it’s unclear how the validation would be useful for molecules. However right now our validation for molecules is checking that it makes Toolkit appropriate OpenFF molecules, and doesn’t actually test QCElemental molecules that will be submitted so there is a gap that assumes that QCSubmit will do the right thing

    3. A promising case would be to deserialize the dataset.json.bz2 (if we have a function to convert a QCF dataset to QCSubmit’s dataset.json.bz2 format).

      • We can perform some validation on the dataset.json instead of an OpenFF dataset, but would then also need a function to convert the dataset.json.bz2 to a QCF dataset without QCSubmit.
        If that was the case, our validation-off label would be used inside validation.py to denote that only the json validation should be done instead of importing into QCSubmit and performing toolkit validation.

It seems to me that we should do validation on QCFractal datasets anyway since we are trying to ensure that that object type meets our documentation standards.

Life Cylcle

Backlog:

  • ✅ Does not use datasets, make issue

Queued Submit

Currently:

  • Creates OpenFF QCSubmit dataset with dataset_class.parse_obj(dataset_data), then submit to QCPortal via QCSubmit.

Alternatives:

  1. Make a function to import a dataset.json.bz2 file as a QCFractal dataset.
    So we would need:

    1. ⚠️✅ A function to “export“ a QCFractal dataset to the dataset.json.bz2 format as defined in QCSubmit. This is almost done, there are some discrepancies with conformer naming.

    2. ⚠️ A function to “import” such a json file to form a QCFractal dataset.

      Need to these a file converters/handlers which will require some debugging.

  2. ⚠️ Save QCFractal dataset as sqlite and have QDS look for either dataset*.json or dataset*.sqlite to import. Only need to add logic to QDS.

Error cycling:

Currently:

  • The file, dataset.json.bz2, is deserialized to generate a dictionary, from which to retrieve the dataset_name and dataset_type.

  • from openff.qcsubmit.serializers import deserialize spec = deserialize(self.submittable)

Alternatives:

  1. ⚠️✅ Generate a dataset.json.bz2 file either as:
    1. Easy: A file only containing dataset_name and dataset_type.
    2. Straightforward: Export QCFractal datasets as dataset.json.bz2 file in the format defined in QCSubmit without using QCSubmit. The information other than dataset_name and dataset_type wouldn’t be used here, but this solution ties to a possible solution in the “Queued Submission“ section. I almost have this function complete for the export which needs some debugging with conformer naming.

    ⚠️✅ Optional, not needed at all: If we wanted to remove QCSubmit dependency here we add a deserializer in QDS to get dataset_name and dataset_type from the dataset.json.bz2

    import bz2, json def deserialize_bz2(filename): with bz2.open(file_path, "rt", encoding="utf-8") as f: data = json.load(f) return data
  2. ⚠️ Offer detection of a QCFractal sqlite file as a QCFractal dataset and obtain the dataset name and type.

Scientific Review:

  • ✅ Does not use datasets.

End of Life:

  • ✅ Does not use datasets.

Archived Complete:

  • ✅ Does not use datasets.

 

Related content