2024-11-25 Mitchell/Wagner/Wang Check-in meeting notes

Participants

  • @Josh Mitchell

  • @Chapin Cavender

  • @Lily Wang

  • @Jeffrey Wagner

Discussion topics

Item

Notes

Item

Notes

Previous to dos

  1. CC will update scripts in repo for butane runs

    1. CC – Everything I used to run over the weekend is reflected in the repo

  2. CC will start butane validation runs on 10 GPUs

    1. CC – Seem to have finished successfully. Ran 72 windows with 1 us each. Aside from a few failures friday (storage issues) everything ran with no problem. When I had issues with those I wiped the data from the pods and resubmitted, and everything went fine then.

    2. JM – Issue probably emerged when we were assigned nodes with a lot of hdd space already used.

    3. JW – Did you set ephemeral storage for resubmission?

    4. CC – No, and everything ran with no problems.

  3. JM will change docker image builds to happen manually and pull from dockerfile in repo

    1. JM – All done. There’s a workflow (based on LW’s workflow) to rebuild image by clocking a button, and it’s linekd in README. The Docker image now ONLY pulls dependencies at build time, then the scripts from the git repo are pulled at runtime (so they’ll update as the repo changes, even if the image isn’t rebuilt)

  4. Everyone will monitor GPU utilization and post in DM thread if something’s up

    1. JW – We’d seen an issue where the (partially) completed run hang around in Error state, and could be fixed by switching to jobs?

    2. JM – Already switched to making them jobs

    3. CC – Submitting as jobs did indeed resolve this. And at least one got kicked off its node and auto-restarted and completed

    4. JW – Did this solve walltime limits?

    5. JM – Walltime limit only applied to pods submitted as pods, not pods submitted as jobs.

    6. CC – That’s what I saw - the jobs that were submitted ran as a single pod that ran for 12-24 hours (except the one that got kicked off and restarted)

NRP setup and debugging

  • CC update

    • I pulled down data from bucket using rclone and they look superficially good - Right number of frames and files. So I have some confidence that bucket is working and checkpointing is working. Running numerical analysis now (some have finished and outputs seem like roughly what I’d expect)

  • JM update

    • No update (I had time for some other work Friday - PR review for MT)

  • Switch from pods to jobs to avoid walltime limits

    • Already done, see above

  • Local storage issues

    • JM – I don’t think this issue is solved, we just haven’t encountered it yet. Next steps would be:

      • Set ephemeral storage to have enough space for trajectory

      • Only copy over files needed for- and produced by- the simulation for this window (instead of copying WHOLE results folder)

    • CC – We haven’t seen this be severe yet since the results folders were copied before we did these runs - Now anything we submit would need to copy in ALL results and this woulds strain resources.

    • JM – An alternative worth considering would be having a persistent volume claim. We could have a single PVC with all our results that we mount whenever we want it. For reference - S3 storage is managed/accessed by rclone. On the other hand, a PVC would be a volume in kuberentes that we’d mount directly in pods. S3 is easier for external storage. PVC is easier for accessing shared folders+files inside a pod. I recommend continuing to use S3 since we need to access our files externally. Also if disk I/O in pods is slow, then PVCs will make this even slower.

      • JW – Agree with using S3. I don’t think disk I/O is super slow, since GPU utilization is so high I don’t think things are waiting on frames being written.

      • CC – Watching the utilization, I’d sometimes see pods drop to 0 utilization for a minute, probably as part of checkpoint writing. If frame writing was also taking a long time I’d expect to see those drops more.

    • If avoiding copying entire results folder doesn’t fix the local storage issue, I could break up trajs so they take even less space.

      • JW – I hope they’re not letting us use a GPU for a day and not supporting us writing 200MB…

      • JM – Breaking up trajs may be superior in general since it will allow for simpler checkpointing/resuming, but I think it’s a bad solution for this problem in particular.

      • JW + CC – Agree

    • JW – Also adding ephemeral storage would be good

      • JM – Wasn’t it 14GB per window of GB3? Will we simulate bigger proteins?

      • CC – No, 14 GB per replica, 500ishMB per window. And maybe we’ll go slightly larger than GB3

      • (General) – We’ll request 20GB just so we don’t have to think about it.

  • Bucket storage update/concerns about storage?

    • JM – No evidence that there’s an issue currently, but we should generally remove files once we’re done using them.

    • CC – Sounds good, I’ll plan to use Gilson Lab servers to have long term storage for this, and will remove results from S3 as they finish.

  • CC – Not urgent/medium term – Right now I’m focusing on umbrella sampling simulations. In the future we’ll want to do conventional long timescale MD (something around the size of GB3, 10us, 3 replicas). At that time we’ll want to break trajs into smaller chunks and I’ll implement that. I can also provide scripts in the umbrella/ directory to dispatch those, but we won’t want to organize by windows since they’ll be single sims

    • JM – Agree and sounds good. Also may simplify things to use env vars to configure the scripts instead of the commandline args. Currently the env vars are being used to set the CLI arg values anyway. The provenance will still be recorded in the yaml file - that contains run script, git info, env var values, and more.

    • JW – That sounds good.

    • CC – So, suggestion is to stop using click as driver from python script?

    • JM – Yes, so then you stop passing all the runtime args from env vars…

    • CC – Could have a shell script that just calls the …

    • …

    • CC – To recap: Currently the yaml file we use to launch the pod takes the run arguments and passes them to the script in the pod via env vars. So if instead…

    • JM – Big benefit of rearranging is that all the things that we modify by hand are in one file. Downside is that this would require CC changing how slurm scripts work to mirror this pattern

    • CC –

    • (discussed design, summarized in to-do item 3 below)

    • .

To do items

  • JM will implement limited rcloneing and not copy over the

  • JM will implement request for 20GB ephemeral storage.

  • JM will adapt YAML template and individual umbrella window script to pass FF, Window, Replica, Target etc directly through environment variables rather than through the intermediary of CLI arguments

  • JM will ping CC once these are done and ready to dispatch protein trajs.

Trello

https://trello.com/b/dzvFZnv4/infrastructure

Action items

Decisions

Â