2025-01-06 to 10 Clark/Wagner Check-in meeting notes

Discussion topics

Item	Notes

Item	Notes
2025-01-06	JCl - HDF5 download from QCA - Can you join science meeting today at 6 PM eastern? JW – Yes, I’ll be there JW – This is a good time to book for Irvine in-person week (Feb 17-21) - Easy to fly to Santa Ana “John Wayne” airport (code “SNA”) and stay at an airport hotel (I’m at the hampton inn across the street). JCl – Should I plan to fly out Sun? JW – Your choice, but try to be available to be onsite Monday at 9 AM. And fly back Fri night/Sat morning (If you choose to stay extra days we’ll still cover your flight home, as long as the price is comparable to flying fri night/sat morning) JCl – Do I need to prepare anything? JW – Not currently, we’ll let you know if that changes. JW – Next Mon we’ll start formal iteration planning with whole org (this is new for all of us) JCl – I already have a big project plan for organometallics. Thinking of making one for datraset longevity. Should I do this in Confluence or hold off and do it on ZenHub? JW – Do it on confluence.
Tasks to pick from	Sage 2.0 dataset cleanup/redo JCl – Possibly follow-on from this would be adding an option to QCSubmit to answer whether a new submission would create new records on the server (as opposed to just reusing existing ones) Taking over NRP compute Debugging organometallics Toolkit implicit Hs PR JW – This should be your top priority on the infrastructure side And possibly further upstream changes Dataset longevity planning (HDF5 downloads, Sage record consolidation, Zenodo upload/other cold storage)
NRP onboarding	https://github.com/openforcefield/qca-dataset-submission/tree/master/NRP
2025-01-07	NRP onboarding JC got kubeconfig file JC will work out onepass access for tomorrow JC will go through NRP quick start PR review (finished+approved)
2025-01-08	Get NRP workers running JW sent JC credentials on protonmail
2025-01-09	Get JC started running QC runners JC – Infrastructure Q: The Chodera lab (e.g. SPICE) uses half optimized and half unoptimized geometries to fit their FF. (The forces taken from the negative gradient instead of minimized Hessian). To generate these unoptimized geometries, Chris Iacovella is running tblite in ASE to get MD frames of TM complexes with MD. He wants to put this functionality into our calculator pipeline. JW – This sounds like a different way to get starting conformers. When we submit a QC dataset, we have two general ways of getting starting conformers 1) Sometimes our starting conformers are external/magic (ex: make a submission where we just “have” some XYZ coordinates to start from, and kinda toss the script that made those XYZ coords into the submission direction on QDS) 2) Sometimes our starting conformers are from a previous QC dataset (like, running an opt dataset, and starting single point calcs from the end points of those) JC – CI is asking whether we can make a new black box for generating conformers, mostly aligning with providing something for the “external/magic” box in category 1. JW – This seems like a cost/benefit question: “should we make a new product to generate confs using tblite?”. I’m generally the person who can estimate costs, LW/JE/DM will need to weigh in on benefit value. This would be easy to implement - QCEngine can do most of the work, we just do some wrapping around it and file formatting (ex getting QCEngine outputs back to SDF format). If there are missing features this may require coordinating with MolSSI/Psi4 devs, and we don’t have a contractual relationship with them (can just ask nicely for things) This would be medium/hard to maintain: tblite is complex and isn’t in our existing dependency stack, so this is a complex upstream we’d be accepting (including all ITS upstreams). If we could somehow define this as not-a-product (but like a one-time artifact). This would be hard to do “reproducibly” - Anything with nonzero temperature simulation is a mess. JC – While moving ahead onboarding with OpenFF methods, JC lab side is moving ahead fast and will do this whether or not we support it. JW – I’d be fine with them creating a script to do this tblite-based conf gen and just putting it in the submission directory. JC – Could add a tblite wrapper to qcengine. Then could make a new kind of dataset like torsiondrive that generates many conformers per input. Should we make a tblite wrapper for QCEngine? JW – Maybe eventually, but if all CI needs for urgent goals is some properties, he’ll be better served by making a python script to parse raw output and get the numbers of interest. Should we make some sort of dataset like TorsionDrives to generate confs using tblite? Should it go in QCSubmit? JW – I think this is all too young to put into a library. JC – Could make a new dataset type for QCSubmit for tblite MD, but with dataset classes defined in a repo owned by Chodera Lab. JW – To get this new type of dataset to meaningfully run as a QCF dataset will also require changes to QCF, which will be a big lift. JC+JW: In summary, this looks like a longer turn around time than continuing with his current running of tblite in their hpc, but something to keep an eye on and consider integrating if invaluable in the future.
2025_01_14	NRP worker handoff JW – I’ve finished running workers for the LipidMAPS dataset tags that I. know about, but there are 500ish more jobs and I don’t know their compute tags. I ran workers with 8 cores and 64GB RAM to complete jobs on the `'pyddx-large'` compute tag I ran workers with 4 cores and 32GB RAM to complete jobs on the `pyddx-medium` compute tag What tag are the remaining lipidMAPS jobs on?
2025_01_16	Continuing lipidMAPS worker setup (JC got lipidmaps workers going)
2025_01_21	Can we reproduce connection error? Rerunning submission to try to reproduce Lily’s error Haven’t found that issue How do we get separate compute tags assigned? JCl I ran into an issue today were numpy.ints didn’t play well with RDKit… maybe that happened here too. ….it did. Why do our workers have so many resets?
2025_01_22	Worker restarts - debugging Lots of pod restarts Lots of idle pods even though there should be tasks available Possible explanaiton: Something about these QC tasks causes managers to die. The pods restart, but the central QCArchive server considers those tasks to be still be claimed by the dead manager for a long time, causing there to be no tasks for the new (restarted) managers to claim. Is it a problem with the node hardware? Do restarted pods always go on the same hardware? Is there a problem with the QM tasks? JC will restart managers with higher update frequency in case there’s a throttling thing going on JC will contact BP to ask about connection throttling Metal representation?
2025_01_23	Debugging disconnections: JC – Could set up our pods to run on a variety of nodes, and have automation check on number of restarts after some period of time (10 mins?) and when nodes get a bunch of resets then add them to an exclude list. JW – This is a great idea

Meetings

2025-01-06 to 10 Clark/Wagner Check-in meeting notes

Discussion topics

Action items

Decisions

Related content