2025-07-29 QCA Dataset Submission Meeting

2025-07-29 QCA Dataset Submission Meeting

Participants

  • @Jennifer Clark

  • @Jeffrey Wagner

  • @Lily Wang

Discussion topics

 

Item

 

Item

Update Dataset Tracking

Project Board; Slides

  • PR489: Lipid Torsiondrives

    • Just submitted

    • Shoutout to Jeff and Matt for a quick OpenFF-Toolkit release and review to make this timely submission

  • Running PR 440: Chodera tmQM

    • Still moving; just finished TMOS infrastructure to characterize and sort

  • PR449: TMBenchmark

    • In scientific review

    • Requires TMOS infrastructure to characterize, and new infrastructure to add solvent

 

  • JC – I’ve been trying to characterize tmQM with TMOS, but I’ve been having trouble parsing RDMols from CSD, where the MDAnalysis and OpenBabel bond order determination is failing. TMOS uses OpenBabel first, then if that fails, MDAnalysis. Since the Openbabel method especially relies on atomic distances I think the coordinates are just too far away.

    • LW – Are these coords pre- or post-optimization?

    • JC – From tmQM directly, I believe preserved from CSD

    • JW – Bohr vs. angstrom (about 2 bohr to 1 angstrom)

      • JC – It’s not bohr

    • JW – Is purpose to…

      • JC – TMOS does the following:

        • Loads structures from XYZ and returns lewis structures.

        • Sanitizes ligands in all permutations of X-type (covalent) and L-type (dative) connections and chooses the one with the most neutral charge and atoms.

        • Assembled the ligands onto the complex resulting in a complex that RDKit can “handle” (with kid gloves), along with the oxidation state, number of electrons on the metal, and the formal charge on the metal.

      • JW – Would falling back to these different reading methods potentially cause trouble if they have different default ways of representing bonds to metals?

        • I separate the ligands from the metal before running them, so it shouldn’t be an issue

    • LW – Problem is that even when removing metals, the chemical graph of the organic part can’t be determined. Maybe related to use of iodine as substitute for metal bonds?

    • JC – I don’t think the fact that there was a metal would affect the process at that stage, but it’s worth a shot. I’m limited to using monovalent atoms that RDKit can handle, so that’s where I got Iodine.

    • JC – Also warning messages about failing to remove hydrogens may be relevant, I’ll look into those. All in all, I wanted to convey that it’s not as smooth sailing as I wanted, but I haven’t given up hope yet.

Dataset archival project

Need characterization of Industry Dataset to expand Blog Post

  • Shoutout to Lily for scripts to use Checkmol for this purpose

MolSSI Info / Align Priorities on MolSSI Asks

No meeting last week

Have we thoroughly tested his changes to Torsiondrive record caching? Does LW have a recommendation?

  • Watch CI on this pr: https://github.com/openforcefield/openff-qcsubmit/pull/347

    • JW: I’ll kick off the CI on bespoke fit

    • LW: It doesn’t look like there would be issues, but maybe I’ll run into something and then just deal with it.

Requests:

  •  

Old Issue of the Week

Missing chemistry to (potentially) cover post-release-1

  • [#8]~[#35]: O-Br single bonds are present in GAFF2 but not present in our current datasets. We could port in a placeholder value from GAFF2, but there are no molecules with this chemistry in our current datasets.

    • Still not addressed

  • [#7X3]~[#7X3]~([#8])~[#8]: Nitroamines

  • [#6:1]~[#6:2]=[#15:3]~[#6:4]: C=P double bond (potentially with adjacent singles)

    • Still not addressed

  • Action: Label with reviewed-2025, Created a Sage 2.4 project and connected this issue.

Bonus: “Add "informative" set (subset) from Ehrman

  • DM suggests a dataset of molecules that tend to have very different energies across different forcefield. This dataset appears to have been added in PR #47

  • Action: Close with comment “This issue was resolved in #47“

AI summary

 

Action items

Decisions