2024-11-20 Mitchell/Cavender/Wagner/Wang Protein simulations on NRP meeting notes

2024-11-20 Mitchell/Cavender/Wagner/Wang Protein simulations on NRP meeting notes

Participants

  • @Chapin Cavender

  • @Josh Mitchell

  • @Jeffrey Wagner

  • @Lily Wang

Discussion topics

Item

Notes

Item

Notes

Needed permissions

  • rclone config / special permissions needed?

    • JM – No, we’ll share these (see below).

  • To get config:

    • kubectl get secret jm-rclone-config -o json

    • Then take the big hashed-looking value and run

    • echo <big value> | base64 --decode

    • One liner: kubectl get secret jm-rclone-config -o jsonpath='{.data.rclone\.conf}' | base64 --decode > ~/.config/rclone/rclone.conf

  • rclone copy --progress nrp:proteinbenchmark-jm-bucket/results results

    • This will be needed to push new umbrella starting points and FFs, goes both ways like rsync/scp

    • There’s also rclone ls to see what’s there and rclone delete to delete files

Necessary steps

  • git clone git@github.com:Yoshanuikabundi/proteinbenchmark-nrp.git

  • In proteinbenchmark_jm_template.yaml, replace all instances of jm with cc, except those in jm-bucket and jm-rclone-config

  • In the python script, change jm to cc

  • JW needed to run mkdir -p results/gb3-null-0.0.3-pair-opc3/replica-1 to get single replica working

    • And after runs, needed to run rm results/gb3-null-0.0.3-pair-opc3/replica-1/gb3-null-0.0.3-pair-opc3-1-00.yaml

  • python run-umbrella-windows.py

  • (after runs complete, shown by “Completed” status, the pods are kept around so that logs hang around)

  • kubectl delete pod <pod name>

Questions

  • Which things should we get set up for ourselves?

    • JM – probably fine to use my secrets, would be redundant for everyone to make their own secrets that are visible to everyone else. Incined to reuse my secrets until something goes wrong.

      • JW – Agree

    • JM – Current shared secret is just the S3 key. That’s fine.

    • JW – How long do persistent volumes last?

      • JM – We’re not using PVs but those last for 30 days of inactivity. We’re using S3, which is permanent (but in reality admins will delete it at some point, don’t recall a time limit)

    • JW – Do results returning to our computers delete them from S3?

      • JM – No. We’ll need to clean those manually.

    • We’ll meet in this timeslot again next Weds and look at storage usage.

    • CC – Looks like one replica (31 windows of 500 ns each, starting from different points) is 14GB, so 3 replicas is 42 GB.

  • How can we run into problems as fast as possible?

  • Who is responsible for running?

    • JW – Could have CC be principally responsible and reach out to us when needed, or have CC send JM the inputs and have JM be principally responsible for running.

    • CC – Prefer running myself and reaching out for help when needed. Can overlap with JM in evenings PT.

    • JM – That makes sense. Should we set up a dockerfile pipeline? Should we move repo to openforcefield org?

    • JW – Let’s move this repo to our org, use the dockerfile in it as authoritative, and have a workflow that is only ever manually triggered to update the docker image.

    • LW – Should we continue using ghcr? Or move to nrp container registry?

      • JM – nrp container registry isn’t always faster - Sometimes things are stored physically far from where they’re run.

      • JW – I think ghcr is fine to continue using, doesn’t seem to be charging us for throughput.

    •  

    •  

Getting new umbrella starting points schlepped to NRP

 

To do items

  • JM will move repo to OFF org

  • JM will have both proteinbenchmark repo (and NRP-running scripts?) always be pulled from github (done)

    • JM will contact CC at this point

  • CC will update scripts in repo for butane runs

  • CC will start butane validation runs on 10 GPUs

  • JM will change docker image builds to happen manually and pull from dockerfile in repo

  • Everyone will monitor GPU utilization and post in DM thread if something’s up

  • Next weds at 4 PM Pacific we’ll discuss storage usage

Action items

Decisions