...
On GPU utilization (and adaptability)
On test runs with Evaluator, GPUs sit idle for some time and then bounce up and down in utilisation
Visualisation of full idleness around 9.45, and bouncing during regular execution
The bouncing should reduce on a proper run with longer simulation time
However, the initial idleness is frustrating and worrisome… to an extent. On test runs it’s ~15 min, anticipate it going up to 30 min before first GPU jobs start to run.
From a preliminary look, I believe this is because all tasks get submitted at once (could be 100% wrong of course)
Evaluator creates a graph of protocol dependencies and submits them, in order, all at once
The presence of GPU-required tasks tells the DaskCluster to scale up the number of workers, not realising that it has to wait for CPU tasks to finish executing first. Hence why the adaptability appears to not be working
Briefly looked into maybe waiting to submit later jobs, but seemed too hard
...