2024-11-04 Westbrook/Wagner Check-in meeting notes

Participants

  • @Brent Westbrook (Unlicensed)

  • @Jeffrey Wagner

Discussion topics

Item

Notes

Item

Notes

General updates

  • JW

    • How's QC worker management going? Utilization looked good over the weekend, and I saw opts getting completed.

      • BW – I did memory strangling for most of them. I did some looking and it seems like all the ones running now are large mols

      • JW – Maybe they’re being enriched - like if they error then error cycling sets them to waiting again, maybe those are the first jobs send to workers.

      • BW – That could be. My current configurations are pretty similar to what you were doing (20x 4 core, 20GB workers, and then 20x 4 core, 30GB, manager told 15).

      • JW – Feel free to scale up to 100 or 200 replicas of each - the current restart rate looks fine.

    • Would you (very optionally!) also want to try making a new image and running workers for the MLPepper set with Charlie Adams from the Cole lab?

      • Resources:

      • BW – I’d be game

      • JW – Could either update the current worker image (negligible chance that this has unintended conda env side effects that mess with provenance or something) or make a new image. I have no preference.

      • BW – I’ll do the former, if adding offtk changes deps then that’s something we should know in general.

      • JW – Starting point for this is CAdams slack thread.

        • Also he uses (4 cores, 40GB) per worker (you might be able to squeeze down the memory once you can measure utilization)

      • BW – Should this action still be pointing at the use-ghcr branch?

      • JW – No, that’s my mistake, please switch it to master

      • BW – And should I request review for the PR?

        • JW – I think you should be fine to merge without review if you feel comfortable with it, but if you have questions or feel uncertain then DO tag me for review and mark it on Trello.

      • (some discussion over root cause, general confusion as to why adding openff-toolkit would change the behavior of qcel/qcf/psi4)

      • JW – Maybe try using one of our current workers BEFORE adding openff-toolkit as a dep to see if we get the same error? Maybe it’s a downstream env thing (eg a pint version or something)

        • BW – Will do

    • Live-review QCSubmit PR?

      • (JW thought it looks great, approved)

  • BW – I requested LM and LW review my YDS PR to add a standard industry benchmark. If that works I’ll be running a bunch of jobs.

    • JW – Sounds great. Also FYI - I’m getting a warning that one of my GH API tokens is about to expire. I can’t remember if this one is essential for AWS jobs, so to find out I’ll let it expire, and wait for you to tell me that something broke

    • BW – Cool, will do.

  • To have YDS upload to “Real” zenodo, we should get Brent API access to openforcefield zenodo acct.

    • (JW logged into zenodo using openff acct, went to “applications” and gave BW a token with “Write” but not “publish” permissions, so there wil still be human reviews before artifacts go live)

Trello

https://trello.com/b/dzvFZnv4/infrastructure

Action items

Decisions