Driver

The main driver(s) or executor(s) of the project

Approver

The person who gives final approval on the project

Contributors

People who contribute work or discussion to a project, e.g. would be credited on any released product or manuscript.

Other stakeholders

People who should be kept informed of project updates, e.g. should be invited to relevant meetings.

Objective

Summarise the objective in 1-2 sentences, e.g. "a force field with a set of virtual sites", or "a lipid force field"

Time frame

Expected time frame for the project to be finished, e.g. Q1 2025. We expect to revisit this as the project progesses, but aim for a realistic estimate that allows for iterative trials.

Key outcomes

The key outcomes or deliverables from this project. Be specific about the features and attributes that describe a successful outcome. Use dot points to be concise. e.g., a force field including:

Key metrics

The complete suite of benchmarks you will use to measure success. e.g., improved or equivalent performance on valence benchmarks to the Industry Benchmark Set, improved perfomance on a curated set of solvation free energies of FreeSolv and MNSol, ...

Status

GitHub repo

A link to a GitHub repo containing work on the project

Slack channel

the go-to Slack channel for discussion about this project

Designated meeting

The go-to meeting for discussion and updates about this project

Released force field

The first released force field this work appears in, or N/A if the project is ended due to poor results.

Publication

The publication on the project, if any.

(blue star) Problem Statement and Objective

Get Evaluator running on Kubernetes smoothly, such that we can practically use it for a vdW fit.

(blue star) Scope

Must have:

  • GPU utilisation > 40% (NRP requirement)

  • Works for fitting to densities and enthalpies of mixing on NRP

  • Adaptable GPU resources

  • Auto-shutdown on completion

  • Some way of shunting GPU-required jobs to GPU workers and CPU jobs to CPU workers

Nice to have:

  • GPU utilisation near 100%

  • Generalisable code – less hardcoding than currently present in test runs

  • Adaptable CPU resources

Not in scope:

  • Objectives and features we decide we will *not* consider in this project, e.g. because they will be targeted in a future phase of the project.

(blue star) Overview

Screenshot 2024-12-03 at 1.10.54 pm.png

Currently the way an Evaluator fit works is that it gets started on a local laptop with ForceBalance, which under the hood uses an EvaluatorClient to communicate with an EvaluatorServer. On an HPC the EvaluatorServer can be local too, but it appears intended to be set up to be remote. What is most important is that the EvaluatorServer has access to the same filesystem as the Dask workers. The Dask cluster is set up via the calculation_backend.

Screenshot 2024-12-03 at 1.12.44 pm.png

To deploy on NRP:

Step-by-step the process is:

Resources

The current way I’ve been partitioning GPU/CPU jobs is with resources (https://distributed.dask.org/en/latest/resources.html). I have been clumsily hardcoding --resources GPU=1,notGPU=0 and --resources GPU=0,notGPU=1 onto my GPU/CPU workers respectively, and specifying resources for individual tasks: https://github.com/lilyminium/openff-evaluator/blob/368341c3c465e5269508906e9c3ef8623d7fa9ae/openff/evaluator/backends/dask_kubernetes.py#L186-L203 . Another way to do it, I believe, is just by specifying which workers are allowed to act on the task.

Adaptability

Currently I’ve been creating one DaskCluster with the “default” GPU worker group and an additional “cpu” worker group. However, adaptive scaling can only apply to the default group. If we wanted adaptive scaling for the CPU worker group too, what may be cleaner and more elegant is having separate clusters for each worker type. This may not be possible with how the tasks are linked, however. Here is a short discussion on the merits (https://github.com/dask/dask-gateway/issues/285)

(blue star) Project Approaches

Use the "Science Project Phase Plan" template to create child pages under this one to document each phase of the project. They will be automatically listed below.

(blue star) References

I ran a proof of concept with python run-job.py: