NIH QM Datasets Research
Goals
This page documents progress made towards the goals of the NIH OMSF subaward with regards to QM datasets:
OMSF will lead in the generation, curation and management of quantum chemistry datasets
via MolSSI QCArchive for the use of OpenFF force field parameterization efforts and other ML
force field efforts, this may include some contributions to QCArchive infrastructure development
and maintenance.
Personnel
Primary personnel: @Marcus Wieder
Primary supervisor: @John Chodera
OMSF approver: @Lily Wang
Current overview
SPICE 2.0 quantum chemical dataset coordination:
Taking over coordinating new datasets going into SPICE 2.0 that will be useful for ML and MM potential construction and assessment
Curating and preparing datasets
Coordinating QCFractal generation of datasets with OpenFF compatible levels of theory
Coordinating Exscientia/Prescient contributions to SPICE 2.0, which may include higher levels of theory as well
Metalloprotein quantum chemical dataset coordination:
Interactions with Genentech on generating data relevant to metalloproteins---this is just at the initial stage of exploration.
Developing an ML potential training and assessment framework suitable for producing next-generation potentials
We’ve produced some planning documents we can share once Chodera Notion is working again
Repo is at: https://github.com/choderalab/modelforge
In the next few weeks, he is also working on these things he can likely cite his OpenFF funding source for, and which we therefore might be able to claim contributions to in a progress report:
A LiveCoMS best practices paper for construction and assessment of ML potentials with CECAM folks
A manuscript with the Boresch and Exscientia groups assessing the ability of different force fields to provide MM -> ML/MM accuracy improvements in hydration free energy calculations
An open source release and corresponding manuscript for the Exscientia
physics-ml
package that includes both trained ML potentials and evaluation tools. Unclear how this will be harmonized with ourmodelforge
, but this was just something we managed to persuade the Exscientia folks to do.
In future:
Lead the targeted generation of QC datasets needed for OpenFF nucleic acids, PDB Chemical Component dictionary, Enamine REALSpace, etc., as well as other needs identified by OpenFF
Lead the development of JAX-based molecular modeling and simulation engine