Evaluator on Kubernetes
Driver | |
---|---|
Approver | |
Contributors | |
Other stakeholders | |
Objective | |
Time frame | |
Key outcomes | |
Key metrics | |
Status | Not started In progress Stalled Complete |
GitHub repo | |
Slack channel | |
Designated meeting | |
Released force field | |
Publication |
Problem Statement and Objective
Get Evaluator running on Kubernetes smoothly, such that we can practically use it for a vdW fit.
Scope
Must have: |
|
---|---|
Nice to have: |
|
Not in scope: |
Overview
Currently the way an Evaluator fit works is that it gets started on a local laptop with ForceBalance, which under the hood uses an EvaluatorClient to communicate with an EvaluatorServer. On an HPC the EvaluatorServer can be local too, but it appears intended to be set up to be remote. What is most important is that the EvaluatorServer has access to the same filesystem as the Dask workers. The Dask cluster is set up via the calculation_backend.
To deploy on NRP:
As mentioned, the EvaluatorServer must have access to the same file system as the workers. That means it must also be remote. It is a separate Kubernetes item to a DaskCluster. Ideally it is a deployment. I’m not sure the consequences if the connection between an EvaluatorServer and local Client is broken, but with progress saved on the shared filesystem and locally there should be ways to keep going.
We don’t have permissions to set up a Dask Cluster through the EvaluatorServer on a remote deployment, because we don’t have permissions to set up a dask cluster on an existing k8s pod. That means we have to create two Evaluator backends:
The “real” one, which deploys the workers and scheduler on NRP and sets up adaptive scaling (e.g. https://github.com/lilyminium/openff-evaluator/blob/368341c3c465e5269508906e9c3ef8623d7fa9ae/openff/evaluator/backends/dask_kubernetes.py#L218)
The communication one for the EvaluatorServer, which simply connects the server to the existing DaskCluster on NRP through the scheduler port (e.g. https://github.com/lilyminium/openff-evaluator/blob/368341c3c465e5269508906e9c3ef8623d7fa9ae/openff/evaluator/backends/dask_kubernetes.py#L359)
Step-by-step the process is:
Create a PVC on NRP to serve as the filesystem
Create and start a DaskCluster on NRP, with the PVC mounted
Create a deployment with an EvaluatorServer that connects to the DaskCluster scheduler
Start the EvaluatorServer
Port-forward the EvaluatorServer port so ForceBalance can connect to it via localhost
Run ForceBalance
(optional) port-forward the dashboard to monitor jobs
Stopping:
Stop the EvaluatorServer and DaskCluster (order doesn’t matter)
Stop the PVC
Resources
The current way I’ve been partitioning GPU/CPU jobs is with resources (https://distributed.dask.org/en/latest/resources.html). I have been clumsily hardcoding --resources GPU=1,notGPU=0
and --resources GPU=0,notGPU=1
onto my GPU/CPU workers respectively, and specifying resources for individual tasks: https://github.com/lilyminium/openff-evaluator/blob/368341c3c465e5269508906e9c3ef8623d7fa9ae/openff/evaluator/backends/dask_kubernetes.py#L186-L203 . Another way to do it, I believe, is just by specifying which workers are allowed to act on the task.
Adaptability
Currently I’ve been creating one DaskCluster with the “default” GPU worker group and an additional “cpu” worker group. However, adaptive scaling can only apply to the default group. If we wanted adaptive scaling for the CPU worker group too, what may be cleaner and more elegant is having separate clusters for each worker type. This may not be possible with how the tasks are linked, however. Here is a short discussion on the merits (https://github.com/dask/dask-gateway/issues/285)
On GPU utilization (and adaptability)
On test runs with Evaluator, GPUs sit idle for some time and then bounce up and down in utilisation
Visualisation of full idleness around 9.45, and bouncing during regular execution
The bouncing should reduce on a proper run with longer simulation time
However, the initial idleness is frustrating and worrisome… to an extent. On test runs it’s ~15 min, anticipate it going up to 30 min before first GPU jobs start to run.
From a preliminary look, I believe this is because all tasks get submitted at once (could be 100% wrong of course)
Evaluator creates a graph of protocol dependencies and submits them, in order, all at once
The presence of GPU-required tasks tells the DaskCluster to scale up the number of workers, not realising that it has to wait for CPU tasks to finish executing first. Hence why the adaptability appears to not be working
Briefly looked into maybe waiting to submit later jobs, but seemed too hard
Project Approaches
References
I ran a proof of concept with python run-job.py
:
It copies across
server-existing.py
for the EvaluatorServerThe spec of the DaskKubernetes cluster is almost fully documented in`cluster-spec.yaml` – the cpu workers are not present.
Based off this branch: https://github.com/lilyminium/openff-evaluator/blob/add-kubernetes-backend
(although as seen in the script I gave up on updating it and iterated on DaskKubernetesBackendMod in the script instead)