Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Currently I’ve been creating one DaskCluster with the “default” GPU worker group and an additional “cpu” worker group. However, adaptive scaling can only apply to the default group. If we wanted adaptive scaling for the CPU worker group too, what may be cleaner and more elegant is having separate clusters for each worker type. This may not be possible with how the tasks are linked, however. Here is a short discussion on the merits (https://github.com/dask/dask-gateway/issues/285)

On GPU utilization (and adaptability)

Screenshot 2024-12-03 at 1.50.17 pm.pngImage Added

  • On test runs with Evaluator, GPUs sit idle for some time and then bounce up and down in utilisation

    • Visualisation of full idleness around 9.45, and bouncing during regular execution

    • The bouncing should reduce on a proper run with longer simulation time

  • However, the initial idleness is frustrating and worrisome… to an extent. On test runs it’s ~15 min, anticipate it going up to 30 min before first GPU jobs start to run.

    • From a preliminary look, I believe this is because all tasks get submitted at once (could be 100% wrong of course)

    • Evaluator creates a graph of protocol dependencies and submits them, in order, all at once

    • The presence of GPU-required tasks tells the DaskCluster to scale up the number of workers, not realising that it has to wait for CPU tasks to finish executing first. Hence why the adaptability appears to not be working

  • Briefly looked into maybe waiting to submit later jobs, but seemed too hard

⚙️ Project Approaches

Child pages (Children Display)
depth1
allChildrentrue
style
excerptTypesimple
first0
sortAndReverse

...

  • View file
    namecluster-spec.yaml
    View file
    namerun-job.py
    View file
    nameserver-existing.py

  • It copies across server-existing.py for the EvaluatorServer

  • The spec of the DaskKubernetes cluster is almost fully documented in`cluster-spec.yaml` – the cpu workers are not present.

  • Based off this branch: https://github.com/lilyminium/openff-evaluator/blob/add-kubernetes-backend

    • (although as seen in the script I gave up on updating it and iterated on DaskKubernetesBackendMod in the script instead)