Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Driver

Approver

Contributors

Other stakeholders

Objective

Time frame

Key outcomes

Key metrics

Status

NOT STARTED IN PROGRESS STALLED COMPLETE

GitHub repo

Slack channel

Designated meeting

Released force field

Publication

\uD83E\uDD14 Problem Statement and Objective

Get Evaluator running on Kubernetes smoothly, such that we can practically use it for a vdW fit.

🎯 Scope

Must have:

  • GPU utilisation > 40% (NRP requirement)

  • Works for fitting to densities and enthalpies of mixing on NRP

  • Adaptable GPU resources

  • Auto-shutdown on completion

  • Some way of shunting GPU-required jobs to GPU workers and CPU jobs to CPU workers

Nice to have:

  • GPU utilisation near 100%

  • Generalisable code – less hardcoding than currently present in test runs

  • Adaptable CPU resources

Not in scope:

⚙️ Overview

Screenshot 2024-12-03 at 1.10.54 pm.png

Currently the way an Evaluator fit works is that it gets started on a local laptop with ForceBalance, which under the hood uses an EvaluatorClient to communicate with an EvaluatorServer. On an HPC the EvaluatorServer can be local too, but it appears intended to be set up to be remote. What is most important is that the EvaluatorServer has access to the same filesystem as the Dask workers. The Dask cluster is set up via the calculation_backend.

Screenshot 2024-12-03 at 1.12.44 pm.png

To deploy on NRP:

Step-by-step the process is:

  • Create a PVC on NRP to serve as the filesystem

  • Create and start a DaskCluster on NRP, with the PVC mounted

  • Create a deployment with an EvaluatorServer that connects to the DaskCluster scheduler

    • Start the EvaluatorServer

  • Port-forward the EvaluatorServer port so ForceBalance can connect to it via localhost

  • Run ForceBalance

  • (optional) port-forward the dashboard to monitor jobs

  • Stopping:

    • Stop the EvaluatorServer and DaskCluster (order doesn’t matter)

    • Stop the PVC

Resources

The current way I’ve been partitioning GPU/CPU jobs is with resources (https://distributed.dask.org/en/latest/resources.html). I have been clumsily hardcoding --resources GPU=1,notGPU=0 and --resources GPU=0,notGPU=1 onto my GPU/CPU workers respectively, and specifying resources for individual tasks: https://github.com/lilyminium/openff-evaluator/blob/368341c3c465e5269508906e9c3ef8623d7fa9ae/openff/evaluator/backends/dask_kubernetes.py#L186-L203 . Another way to do it, I believe, is just by specifying which workers are allowed to act on the task.

Adaptability

Currently I’ve been creating one DaskCluster with the “default” GPU worker group and an additional “cpu” worker group. However, adaptive scaling can only apply to the default group. If we wanted adaptive scaling for the CPU worker group too, what may be cleaner and more elegant is having separate clusters for each worker type. This may not be possible with how the tasks are linked, however. Here is a short discussion on the merits (https://github.com/dask/dask-gateway/issues/285)

⚙️ Project Approaches

📖 References

I ran a proof of concept with python run-job.py:

  • It copies across server-existing.py for the EvaluatorServer

  • The spec of the DaskKubernetes cluster is almost fully documented in`cluster-spec.yaml` – the cpu workers are not present.

  • No labels