...
Excerpt |
---|
Initial approach as suggested stakeholder consensus. |
👀 Overview
Summary | Create a QM dataset from an existing chemical structure databases running optimization, torsion-drive, and *new* electronic property calculation types. Datasets will be split into two sets, those with metal centers of primary interest (i.e., Pd, Fe, Zn, Mg, Cu, Li) and those of secondary interest (i.e., Rh, Ir, Pt, Ni, Cr, Ag). Those of secondary interest will only be run with OPT calculation with GFN2-XTB. Those of primary interest will be run with a target level of theory, to be determined in this work, and have the target electronic properties evaluated. These dataset will be accompanied by organic compound elements: C, H, P, S, O, N, F, Cl, Br. To achieve this the standard OpenFF QCA dataset submission pipeline must be adapted in multiple ways.
| ||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GitHub link | |||||||||||||||||||||||
Status |
|
Table of Contents | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
\uD83D\uDEA9 Milestones and metrics
Datasets will be labeled as DS#-XXX-{1,2}, e.g. DS1-CCD-2, which denotes dataset 1, taken from the CCD database, with the metal centers of secondary interest. The first number and the middle three letter code are always paired to avoid confusion between similar database abbreviations, e.g. CCD vs COD vs CSD. The last number denotes inclusion of metal centers of primary and secondary interest:
1: primary interest: {Pd, Fe, Zn, Mg, Cu, Li}
2: secondary interest: {Rh, Ir, Pt, Ni, Cr, Ag}
Stage | Milestone/Benchmark | Contributors | Deadline | Status | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Curate training dataset | Work out best level of theory for the training dataset | November 10, 2024 |
| ||||||||||||
Compute training dataset | December 31, 2024Add ability for conformers to be imported into qc-submit | Assess ability for conformers to be added into qc-submit |
| ||||||||||||
Resolve qc-submit CMILES incompatibility with organometallic complexes | Determine if RDKit functionality will perform adequately |
| |||||||||||||
Curate testing dataset | Compile QM dataset | November 30, 2024If RDKit will not handle CMILES, skip for cif to qca interaction |
| ||||||||||||
Compute QM dataset | January 31, 2025If RDKit will handle CMILES, assess work around for OpenEye, or implement error handling |
| Compile simulation test set (Free Solv, maybe non-hydration solvation free energy sets that are harder to reproduce) | ||||||||||||
April Curate opt training dataset | Filter PDB Chemical Component Dictionary (CCD) and submit DS1-CCD-1 and DS1-CCD-2 at BP86 / def2-TZVP | Jennifer A Clark, Brent Westbrook | Jan. 15, 2025 |
| Determine best NN architecture | Implement attention-based GNN | December 31, 2024 | ||||||||
Submit DS1-CCD-1 and DS1-CCD-2 at alternative model chemistries for assessment |
| ||||||||||||||
Choose model chemistry based off of DS1-CCD-1 and DS1-CCD-2 |
| Implement bond features in GraphSAGE (?) | |||||||||||||
December 31, 2024Filter Crystallography Open Database (COD) and submit OPT DS2-COD-1 and DS2-COD-2 at GFN2-XTB |
| ||||||||||||||
Filter CSD (cambridge strucural database) and submit OPT DS3-CSD-1 and DS3-CSD-2 at GFN2-XTB with structures neglected by tmQM |
| ||||||||||||||
Determine best architecture | January 31, 2025Filter MPtrj: Materials Project Trajectory Dataset and submit OPT DS4-MPT-1 and DS4-MPT-2 at GFN2-XTB |
| |||||||||||||
First pass at NN training | Train using just ESPs, dipoles, quadrupoles | Feb 28, 2025Submit DS2-COD-1 OPT at target model chemistry |
| ||||||||||||
Submit DS3-CSD-1 OPT at target model chemistry |
| ||||||||||||||
Train directly to charge model if still having issuesSubmit DS4-MPT-1 OPT at target model chemistry |
| Benchmark 1: QM | Neural network charge model with low testing error on QM data (ESPs, dipoles) | March 15, 2025 | Re-train VDW terms | March 30, 2025||||||||||
Curate electronic properties training dataset | Define primary and secondary properties of interest | Jennifer A Clark , Chris Iacovella |
| ||||||||||||
Determine output protocol of primary properties of interest and implement |
| ||||||||||||||
Determine output protocol of secondary properties of interest and implement |
| ||||||||||||||
Submit DS1-CCD-1 Electronic Property calculation at target model chemistry |
| Re-train valence terms | |||||||||||||
April 15, 2025 | Submit DS2-COD-1 Electronic Property calculation at target model chemistry |
| Benchmark 2: Simulation | Neural network charge model with equivalent or better performance to NAGL in simulations | April 30, 2025|||||||||||
Submit DS3-CSD-1 Electronic Property calculation at target model chemistry |
| ||||||||||||||
Submit DS4-MPT-1 Electronic Property calculation at target model chemistry |
|