2024-10-31 BW/JW QC worker handoff
Participants
@Brent Westbrook
@Jeffrey Wagner
Discussion topics
Notes |
---|
openff-qca-qm-jw-ddx-d56fd8566-rs78p 1/1 Running 0 3h30m
openff-qca-qm-jw-ddx-d56fd8566-rsqlp 1/1 Running 0 4h20m
openff-qca-qm-jw-ddx-d56fd8566-s7gxp 0/1 Error 0 3h31m
openff-qca-qm-jw-ddx-d56fd8566-s8698 0/1 ContainerStatusUnknown 1 (4h12m ago) 4h20m
openff-qca-qm-jw-ddx-d56fd8566-sjzqv 0/1 Error 0 3h34m
openff-qca-qm-jw-ddx-d56fd8566-skfmq 0/1 ContainerStatusUnknown 2 (3h43m ago) 4h20m
openff-qca-qm-jw-ddx-d56fd8566-snptk 1/1 Running 1 51m
openff-qca-qm-jw-ddx-d56fd8566-snwkr 1/1 Running 0 21h
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 28m default-scheduler Successfully assigned openforcefield/openff-qca-qm-jw-ddx-d56fd8566-shkmp to sdsmt-fiona.sdsmt.edu
Normal Pulled 28m kubelet Container image "ghcr.io/openforcefield/qca-dataset-submission:qcarchive-worker-openff-psi4-ddx-latest" already present on machine
Normal Created 28m kubelet Created container openff-pod
Normal Started 28m kubelet Started container openff-pod
Warning Evicted 22m kubelet Pod ephemeral local storage usage exceeds the total limit of containers 20Gi.
Normal Killing 22m kubelet Stopping container openff-pod
Warning ExceededGracePeriod 22m kubelet Container runtime did not kill the pod within specified grace period.
This was an “ephemeral storage” error, other times they’ll have
What I’ve tried most recently is lying to the MANAGER about how much memory the CONTAINER has. So I make manager.yaml look like executors:
local_executor:
type: local
max_workers: 1
cores_per_worker: 4
memory_per_worker: 15
but deployment.yaml looks like (Important thing is that the container has 30GB but I only tell manager it has 15GB) This has actually kept there from being OOMKilled issues, but
Tools availble to you are:
|
Jeff manager.yaml:
Jeff’s deployment.yaml
NOTE that you’ll need to change all instances of “JW” to “BW” And my secret-making commands are: Any time you change manager.yaml, you’ll need to delete the deployment, delete the secret, create it from the new manager.yaml, and then restart the deployment
|
Action items
Decisions