2024-10-31 BW/JW QC worker handoff

Participants

  • @Brent Westbrook (Unlicensed)

  • @Jeffrey Wagner

Discussion topics

Notes

Notes

openff-qca-qm-jw-ddx-d56fd8566-rs78p 1/1 Running 0 3h30m openff-qca-qm-jw-ddx-d56fd8566-rsqlp 1/1 Running 0 4h20m openff-qca-qm-jw-ddx-d56fd8566-s7gxp 0/1 Error 0 3h31m openff-qca-qm-jw-ddx-d56fd8566-s8698 0/1 ContainerStatusUnknown 1 (4h12m ago) 4h20m openff-qca-qm-jw-ddx-d56fd8566-sjzqv 0/1 Error 0 3h34m openff-qca-qm-jw-ddx-d56fd8566-skfmq 0/1 ContainerStatusUnknown 2 (3h43m ago) 4h20m openff-qca-qm-jw-ddx-d56fd8566-snptk 1/1 Running 1 51m openff-qca-qm-jw-ddx-d56fd8566-snwkr 1/1 Running 0 21h
  • run eg kubectl describe pod <bad pod name>

  • Sometimes failures look like

Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 28m default-scheduler Successfully assigned openforcefield/openff-qca-qm-jw-ddx-d56fd8566-shkmp to sdsmt-fiona.sdsmt.edu Normal Pulled 28m kubelet Container image "ghcr.io/openforcefield/qca-dataset-submission:qcarchive-worker-openff-psi4-ddx-latest" already present on machine Normal Created 28m kubelet Created container openff-pod Normal Started 28m kubelet Started container openff-pod Warning Evicted 22m kubelet Pod ephemeral local storage usage exceeds the total limit of containers 20Gi. Normal Killing 22m kubelet Stopping container openff-pod Warning ExceededGracePeriod 22m kubelet Container runtime did not kill the pod within specified grace period.

This was an “ephemeral storage” error, other times they’ll have OOMKilled under the Last State: field. I suspect it’s the same root cause (just whether it’s caused by hitting container RAM limits or disk space using swap) but not confirmed.

 

What I’ve tried most recently is lying to the MANAGER about how much memory the CONTAINER has. So I make manager.yaml look like

executors: local_executor: type: local max_workers: 1 cores_per_worker: 4 memory_per_worker: 15

 

but deployment.yaml looks like

(Important thing is that the container has 30GB but I only tell manager it has 15GB)

This has actually kept there from being OOMKilled issues, but

  1. Things SEEM to be running slowly (20 workers ran overnight and error cycling on the PR showed 6739 completed in the morning vs. 6701 completed the evening before). So the memory strangling technique may be very inefficient

  2. On the grafana dashboard, many workers seem to be flatlined at a certian memory limit, and this MAY be a symptom of the memory strangling (but it might also be a telemetry issue, hard to tell)

Tools availble to you are:

  • Starting the error cycling action to get dataset status (or you might know how to query QCA datasets directly)

  • Splitting up the dataset (would require a new PR )

  • Spinning up differnet sized workers that achieve our target average utilization.

  •  

 

 

Jeff manager.yaml:

 

 

 

Jeff’s deployment.yaml

 

NOTE that you’ll need to change all instances of “JW” to “BW”

And my secret-making commands are:

Any time you change manager.yaml, you’ll need to delete the deployment, delete the secret, create it from the new manager.yaml, and then restart the deployment

 

 

Action items

Decisions