Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version.
Compare with Current
View Page History
« Previous
Version 2
Next »
Participants
Discussion topics
Item | Presenter | Notes |
---|
XFF dataset submission | Lily Wang | |
How to get workers running
| Jeffrey Wagner | Jeff is in charge of getting workers up, he will teach us how to do it Getting QCFractal workers on a Kubernetes cluster Kubernetes is like SLURM but makes Docker images instead of running on a machine somewhere, kind of like a clean new machine so you can install new things and can’t mess anything up Need deployment.yaml and manager.yaml, and authentication file for Nautilus Nautilus is preemptible time on supercomputers all over, but very strict about banning people These files contain an access token to modify QCArchive--keep it safe and don’t accidentally modify QCArchive
Readme file has lots of commands to submit, install software, etc Have a “secret” in manager.yaml which allows us to store QCArchive/authentication stuff “Pod” is an actively running image--one image can make many pods QCFractal worker will get started--knows # processors/RAM, knows programs installed, URL to talk to QCArchive manager.yaml has QCA credentials, needs those to be able to do jobs executors controls number of processors, etc
deployment.yaml is like batch script for talking to cluster name for internal use replicas is how many jobs--start small, but once you get the hang of it can increase, change for the right job size selector, template, spec are all computer specific, obtained by talking to sys admin; prioritize getting access to more computers vs not killing jobs since QCA is fault-tolerant container is our docker hub resources--must match between this and manager.yaml. may need to increase CPU or memory at times, make sure to change both volume mounts has QCA authentication secret from earlier env:
LW: what is the easiest way to get into trouble? LW: with CPU/memory--can we ask for more? JW: usually reach out to admin if we go over 500 CPUs, usually they just say to do whatever you want, you can ask for whatever and they put you in a prioritization queue JW: make sure to periodically check in--when things get resubmitted they (e.g.) can do conda updates, could cause an issue and cause it to crash over and over, make sure that doesn’t happen
Once you see that the job is working normally, you can increase the number of replicas Commands not in README: kubectl get deployment openff-qca-qm kubectl logs openff-qca-qm-xxxxxx…. kubectl scale --replicas=y -f deployment.yaml
To do: obtain Nautilus credentials? https://docs.nationalresearchplatform.org/userdocs/start/quickstart/ https://grafana.nrp-nautilus.io/dashboards Dashboard to monitor CPU usage LW: is there a resubmission mechanism? Downside of PRP and upside of Lilac is that we don’t get many large memory workers, problematic for large molecules
|
Action items
Decisions
0 Comments