Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Excerpt

Initial approach as suggested stakeholder consensus.

👀 Overview

Summary

Create a QM dataset from an existing chemical structure databases running optimization, torsion-drive, and *new* electronic property calculation types. Datasets will be split into two sets, those with metal centers of primary interest (i.e., Pd, Fe, Zn, Mg, Cu, Li) and those of secondary interest Generate standard operating procedure for dataset continuity (DS-Continuity-SOP) to copy key OpenFF datasets into a secondary storage platform (i.e., Rh, Ir, Pt, Ni, Cr, Ag). Those of secondary interest will only be run with OPT calculation with GFN2-XTB. Those of primary interest will be run with a target level of theory, to be determined in this work, and have the target electronic properties evaluated. These dataset will be accompanied by organic compound elements: C, H, P, S, O, N, F, Cl, Br.

To achieve this the standard OpenFF QCA dataset submission pipeline must be adapted in multiple ways.

  • Address incompatibility with existing OpenFF infrastructure with Organometallic CMILES

  • Add ability for conformers to be imported into qc-submit

  • Create new dataset type with relevant properties

    Zenodo) in a future-proof file format.

    GitHub link

    Status

    Status
    titleNOT STARTED
    Status
    colourYellow
    titleIN PROGRESS
    Status
    colourGreen
    titleCOMPLETED
    Status
    colourRed
    titleWON'T PROGRESS

    ...

    \uD83D\uDEA9 Milestones and metrics

    Datasets will be labeled as DS#-XXX-{1,2}, e.g. DS1-CCD-2, which denotes dataset 1, taken from the CCD database, with the metal centers of secondary interest. The first number and the middle three letter code are always paired to avoid confusion between similar database abbreviations, e.g. CCD vs COD vs CSD. The last number denotes inclusion of metal centers of primary and secondary interest:

    1: primary interest: {Pd, Fe, Zn, Mg, Cu, Li}
    2: secondary interest: {Rh, Ir, Pt, Ni, Cr, Ag}

    Stage

    Milestone/Benchmark

    Contributors

    Deadline

    Status

    Add ability for conformers to be imported into qc-submit

    Assess ability for conformers to be added into qc-submitGenerate Standard Operating Procedure

    Determine qcportal capability to download datasets locally

    Jennifer A Clark

    Determine if RDKit functionality will perform adequately

    Status
    colourYellow
    titleNot started

    Resolve qc-submit CMILES incompatibility with organometallic complexes

    In progress

    Determine final future proof file format for datasets

    Jennifer A Clark

    Status
    colourGreen
    titleNot started
    If RDKit will not handle CMILES, skip for cif to qca interaction
    COMPLETED

    Determine file conversion strategy from output of qcportal to future proof file format

    Jennifer A Clark

    Status
    titleNot started

    If RDKit will handle CMILES, assess work around for OpenEye, or implement error handlingDetermine location to aggregate or reference Zenodo DOIs

    Jennifer A Clark

    Not started
    Status
    title

    Curate opt training dataset

    Filter PDB Chemical Component Dictionary (CCD) and submit DS1-CCD-1 and DS1-CCD-2 at BP86 / def2-TZVP

    Jennifer A Clark, Brent Westbrook

    Jan. 15, 2025

    Status
    colourYellow
    titleIn progress

    Submit DS1-CCD-1 and DS1-CCD-2 at alternative model chemistries for assessmentCombine pipeline elements into DS-continuity-SOP

    Jennifer A Clark

    Status
    titleNot started

    Choose model chemistry based off of DS1-CCD-1 and DS1-CCD-2Create dataset collection on qcportal

    Debug QCA-Dataset-Submission issues, or establish record keeping mechanism for direct QCPortal use.

    Jennifer A Clark ,Lily Wang

    Status
    colourYellow
    titleNot started
    Filter Crystallography Open Database (COD) and submit OPT DS2-COD-1 and DS2-COD-2 at GFN2-XTB
    In progress

    Sage 2.0.0

    Jennifer A Clark

    Status
    colourYellow
    titleNot started
    Filter CSD (cambridge strucural database) and submit OPT DS3-CSD-1 and DS3-CSD-2 at GFN2-XTB with structures neglected by tmQM
    In progress

    Sage 2.1.0

    Jennifer A Clark

    Status
    colourYellow
    titleNot started
    Filter MPtrj: Materials Project Trajectory Dataset and submit OPT DS4-MPT-1 and DS4-MPT-2 at GFN2-XTB
    In progress

    Sage 2.2.0

    Jennifer A Clark

    Status
    titleNot started

    Submit DS2-COD-1 OPT at target model chemistryIndustry Benchmarking

    Jennifer A Clark

    Status
    titleNot started

    Submit DS3-CSD-1 OPT at target model chemistryPublish OpenFF datasets

    Apply DS-continuity-SOP to Sage 2.0.0

    Jennifer A Clark

    Status
    titleNot started

    Submit DS4Apply DS-MPTcontinuity-1 OPT at target model chemistrySOP to Sage 2.1.0

    Jennifer A Clark

    Status
    titleNot started

    Curate electronic properties training dataset

    Define primary and secondary properties of interest

    Jennifer A Clark , Chris Iacovella

    Status
    colourGreen
    titleCOMPLETED

    Determine output protocol of primary properties of interest and implement

    Jennifer A Clark

    Status
    colourYellow
    titleIN PROGRESS

    Determine output protocol of secondary properties of interest and implement

    Jennifer A Clark

    Status
    colourYellow
    titleIN PROGRESS

    Submit DS1-CCD-1 Electronic Property calculation at target model chemistryApply DS-continuity-SOP to Sage 2.2.0

    Jennifer A Clark

    Status
    titleNot started

    Submit DS2Apply DS-COD-1 Electronic Property calculation at target model chemistrycontinuity-SOP to Industry Benchmarking dataset

    Jennifer A Clark

    Status
    titleNot started

    Submit DS3-CSD-1 Electronic Property calculation at target model chemistryDetermine other benchmarking datasets of interest

    Jennifer A Clark

    Status
    titleNot started

    Submit DS4-MPT-1 Electronic Property calculation at target model chemistry

    Jennifer A Clark

    Status
    titleNot started

    📊 Progress and findings

    ...

    📊 Progress and findings

    Generate Standard Operating Procedure

    Although not supported now, within the timeframe of this project MolSSI is expected to have restored the qcportal capability of “dataset views” to allow downloading the files in some format. It should be trivial to export from there to qcschema molecules in hdf5 format. [QCA Users 2025-01-07]

    Create Dataset Collection on QCPortal

    This initiative was expected to be achieved by combining record lists from the published list from several datasets. However, after pushing to QCPortal additional specification keywords are added and new records are spawned. It is expected to have to do with the recent QCPortal upgrade? Strongly considering fallback to creating a collection directly with QCPortal, bypassing qc-submit… 2025-01-09 JCl/LW check-in

    Publish OpenFF Datasets