2024-02-13 QCSubmit discussion meeting

Participants

  • @Lily Wang

  • @Alexandra McIsaac

  • @Jeffrey Wagner

  • @Brent Westbrook (Unlicensed)

Discussion topics

Item

Presenter

Notes

Item

Presenter

Notes

XFF dataset submission

@Lily Wang

  • Already has two errors somehow for Torsion Drive, only provides traceback for Opt--look into that?


How to get workers running

@Jeffrey Wagner

  • Recording here, requires OpenFF login

  • Jeff is in charge of getting workers up, he will teach us how to do it

  • Getting QCFractal workers on a Kubernetes cluster

  • Kubernetes is like SLURM but makes Docker images instead of running on a machine somewhere, kind of like a clean new machine so you can install new things and can’t mess anything up

  • Need deployment.yaml and manager.yaml, and authentication file for Nautilus

    • Nautilus is preemptible time on supercomputers all over, but very strict about banning people

    • These files contain an access token to modify QCArchive--keep it safe and don’t accidentally modify QCArchive

  • Readme file has lots of commands to submit, install software, etc

    • Have a “secret” in manager.yaml which allows us to store QCArchive/authentication stuff

    • “Pod” is an actively running image--one image can make many pods

    • QCFractal worker will get started--knows # processors/RAM, knows programs installed, URL to talk to QCArchive

      • manager.yaml has QCA credentials, needs those to be able to do jobs

      • executors controls number of processors, etc

    • deployment.yaml is like batch script for talking to cluster

      • name for internal use

      • replicas is how many jobs--start small, but once you get the hang of it can increase, change for the right job size

      • selector, template, spec are all computer specific, obtained by talking to sys admin; prioritize getting access to more computers vs not killing jobs since QCA is fault-tolerant

      • container is our docker hub

        • resources--must match between this and manager.yaml. may need to increase CPU or memory at times, make sure to change both

        • volume mounts has QCA authentication secret from earlier

        • env:

          • only allow it to use half the allocated memory because it tends to go over

  • LW: what is the easiest way to get into trouble?

    • JW: underutilizing (e.g requesting more than you’re actually using)--they’ll yell at you and at jeff

    • LW: Is there a particular utilization threshhold that we have to stick to?

      • JW: doesn’t remember a specific threshold, seems like other people have wide variety of utilization, will need to keep an eye on it

  • LW: with CPU/memory--can we ask for more?

    • JW: usually reach out to admin if we go over 500 CPUs, usually they just say to do whatever you want, you can ask for whatever and they put you in a prioritization queue

    • JW: make sure to periodically check in--when things get resubmitted they (e.g.) can do conda updates, could cause an issue and cause it to crash over and over, make sure that doesn’t happen

  • Once you see that the job is working normally, you can increase the number of replicas

    • LW: Do workers kill themselves once they’re out of jobs, or do you have to manually delete them?

      • JW: No, you have to manually delete them, so once many jobs are done you need to keep an eye on it to make sure the utilization is high

      • JW: Can scale it down as well as up, but doesn’t always get ones that aren’t running

  • Commands not in README:

    • kubectl get deployment openff-qca-qm

    • kubectl logs openff-qca-qm-xxxxxx….

    • kubectl scale --replicas=y -f deployment.yaml

  • To do: obtain Nautilus credentials? https://docs.nationalresearchplatform.org/userdocs/start/quickstart/

    • Next meeting

  • https://grafana.nrp-nautilus.io/dashboards Dashboard to monitor CPU usage

  • LW: is there a resubmission mechanism?

    • JW: repo error cycling resubmits jobs that errored out and changes them to waiting

  • Downside of PRP and upside of Lilac is that we don’t get many large memory workers, problematic for large molecules

    • Admins don’t like requesting large amount for some molecules but not using it for the occasional normal sized molecule

    • Can use high-memory tag on github if we need to

Action items

Decisions