Jeff is in charge of getting workers up, he will teach us how to do it
Getting QCFractal workers on a Kubernetes cluster
Kubernetes is like SLURM but makes Docker images instead of running on a machine somewhere, kind of like a clean new machine so you can install new things and can’t mess anything up
Need deployment.yaml and manager.yaml, and authentication file for Nautilus
Nautilus is preemptible time on supercomputers all over, but very strict about banning people
These files contain an access token to modify QCArchive--keep it safe and don’t accidentally modify QCArchive
Readme file has lots of commands to submit, install software, etc
Have a “secret” in manager.yaml which allows us to store QCArchive/authentication stuff
“Pod” is an actively running image--one image can make many pods
QCFractal worker will get started--knows # processors/RAM, knows programs installed, URL to talk to QCArchive
manager.yaml has QCA credentials, needs those to be able to do jobs
executors controls number of processors, etc
deployment.yaml is like batch script for talking to cluster
name for internal use
replicas is how many jobs--start small, but once you get the hang of it can increase, change for the right job size
selector, template, spec are all computer specific, obtained by talking to sys admin; prioritize getting access to more computers vs not killing jobs since QCA is fault-tolerant
container is our docker hub
resources--must match between this and manager.yaml. may need to increase CPU or memory at times, make sure to change both
volume mounts has QCA authentication secret from earlier
env:
only allow it to use half the allocated memory because it tends to go over
LW: what is the easiest way to get into trouble?
JW: underutilizing (e.g requesting more than you’re actually using)--they’ll yell at you and at jeff
LW: Is there a particular utilization threshhold that we have to stick to?
JW: doesn’t remember a specific threshold, seems like other people have wide variety of utilization, will need to keep an eye on it
LW: with CPU/memory--can we ask for more?
JW: usually reach out to admin if we go over 500 CPUs, usually they just say to do whatever you want, you can ask for whatever and they put you in a prioritization queue
JW: make sure to periodically check in--when things get resubmitted they (e.g.) can do conda updates, could cause an issue and cause it to crash over and over, make sure that doesn’t happen
Once you see that the job is working normally, you can increase the number of replicas
LW: Do workers kill themselves once they’re out of jobs, or do you have to manually delete them?
JW: No, you have to manually delete them, so once many jobs are done you need to keep an eye on it to make sure the utilization is high
JW: Can scale it down as well as up, but doesn’t always get ones that aren’t running