2024-02-13 QCSubmit discussion meeting

Participants

@Lily Wang
@Alexandra McIsaac
@Jeffrey Wagner
@Brent Westbrook (Unlicensed)

Discussion topics

Item	Presenter	Notes

Item	Presenter	Notes
XFF dataset submission	@Lily Wang	Already has two errors somehow for Torsion Drive, only provides traceback for Opt--look into that?
How to get workers running	@Jeffrey Wagner	Recording here, requires OpenFF login Jeff is in charge of getting workers up, he will teach us how to do it Getting QCFractal workers on a Kubernetes cluster Kubernetes is like SLURM but makes Docker images instead of running on a machine somewhere, kind of like a clean new machine so you can install new things and can’t mess anything up Need deployment.yaml and manager.yaml, and authentication file for Nautilus Nautilus is preemptible time on supercomputers all over, but very strict about banning people These files contain an access token to modify QCArchive--keep it safe and don’t accidentally modify QCArchive Readme file has lots of commands to submit, install software, etc Have a “secret” in manager.yaml which allows us to store QCArchive/authentication stuff “Pod” is an actively running image--one image can make many pods QCFractal worker will get started--knows # processors/RAM, knows programs installed, URL to talk to QCArchive manager.yaml has QCA credentials, needs those to be able to do jobs executors controls number of processors, etc deployment.yaml is like batch script for talking to cluster name for internal use replicas is how many jobs--start small, but once you get the hang of it can increase, change for the right job size selector, template, spec are all computer specific, obtained by talking to sys admin; prioritize getting access to more computers vs not killing jobs since QCA is fault-tolerant container is our docker hub resources--must match between this and manager.yaml. may need to increase CPU or memory at times, make sure to change both volume mounts has QCA authentication secret from earlier env: only allow it to use half the allocated memory because it tends to go over LW: what is the easiest way to get into trouble? JW: underutilizing (e.g requesting more than you’re actually using)--they’ll yell at you and at jeff LW: Is there a particular utilization threshhold that we have to stick to? JW: doesn’t remember a specific threshold, seems like other people have wide variety of utilization, will need to keep an eye on it LW: with CPU/memory--can we ask for more? JW: usually reach out to admin if we go over 500 CPUs, usually they just say to do whatever you want, you can ask for whatever and they put you in a prioritization queue JW: make sure to periodically check in--when things get resubmitted they (e.g.) can do conda updates, could cause an issue and cause it to crash over and over, make sure that doesn’t happen Once you see that the job is working normally, you can increase the number of replicas LW: Do workers kill themselves once they’re out of jobs, or do you have to manually delete them? JW: No, you have to manually delete them, so once many jobs are done you need to keep an eye on it to make sure the utilization is high JW: Can scale it down as well as up, but doesn’t always get ones that aren’t running Commands not in README: kubectl get deployment openff-qca-qm kubectl logs openff-qca-qm-xxxxxx…. kubectl scale --replicas=y -f deployment.yaml To do: obtain Nautilus credentials? https://docs.nationalresearchplatform.org/userdocs/start/quickstart/ Next meeting https://grafana.nrp-nautilus.io/dashboards Dashboard to monitor CPU usage https://grafana.nrp-nautilus.io/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&refresh=10s&var-datasource=default&var-cluster=&var-namespace=openforcefield Per-pod dashboard Need to monitor this every few hours while running, if you don’t and the utilization gets too low they might shut down all your jobs LW: is there a resubmission mechanism? JW: repo error cycling resubmits jobs that errored out and changes them to waiting Downside of PRP and upside of Lilac is that we don’t get many large memory workers, problematic for large molecules Admins don’t like requesting large amount for some molecules but not using it for the occasional normal sized molecule Can use high-memory tag on github if we need to

Meetings

2024-02-13 QCSubmit discussion meeting

Participants

Discussion topics

Action items

Decisions