JP / EN

Solving Kata Containers SandboxChanged Error


TL;DR

For those wanting quick solutions:

  • Container started in k8s/k3s auto-restarts (dies and recreates) after 1-2 minutes
  • Shows SandboxChanged: Pod sandbox changed, it will be killed and re-created

Conclusion

Conflict between systemd and container runtime cgroup management causes sudden pod death.

Need to specify cgroup v1 on host:

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"

sudo update-grub
sudo reboot

Environment

Environment: k3s cluster, Ubuntu 24.04, Kata Containers 3.2.0 Master: i3 7100 4GB * 3 Worker: xeon 2699v4 32GB * 1

Normal runc ran correctly for 10+ minutes. Kata Containers pods disappeared and recreated after 1-2 minutes without error logs.

Memory shortage?

$ kubectl exec -it ue5-gameserver-966c8c687-9z96b -n game -- /bin/sh
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           9.9Gi       287Mi       9.5Gi       5.0Mi       157Mi       9.5Gi
Swap:             0B          0B          0B
$ cat /proc/meminfo | grep MemTotal
MemTotal:       10416748 kB
$ ps aux | grep "qemu-system-x86_64" | grep "\-m"
root  563936  53.0  2.3  ... -m 8196M,slots=10,maxmem=33006M ...

Allocated ridiculous memory, no effect.

Testing with Lightweight Container

apiVersion: v1
kind: Pod
metadata:
  name: kata-test-nginx
spec:
  runtimeClassName: kata
  containers:
  - name: nginx
    image: nginx:alpine
    resources:
      limits:
        memory: "256Mi"
---
apiVersion: v1
kind: Pod
metadata:
  name: kata-test-stress
spec:
  runtimeClassName: kata
  containers:
  - name: stress
    image: polinux/stress
    command: ["stress"]
    args: ["--cpu", "1", "--vm", "1", "--vm-bytes", "128M", "--timeout", "3600s"]
NAME                              READY   STATUS    RESTARTS   AGE
kata-test-nginx                   1/1     Running   0          5m7s
kata-test-stress                  1/1     Running   2          5m7s    # ← 2 restarts
ue5-gameserver-966c8c687-7m85p    1/1     Running   5          13m     # ← 5 restarts
ue5-gameserver-966c8c687-twhxr    1/1     Running   3          13m     # ← 3 restarts

Lightweight one didn’t crash, high-load ones did. Originally running Unreal Engine dedicated server, but needed to identify if it’s UE’s problem or k3s-side. Result was k3s-side issue.

Logs

$ kubectl describe pod ue5-gameserver-966c8c687-qz4sn -n game | grep -A 10 "Events:"
Events:
  Type    Reason          Age                 From               Message
  ----    ------          ----                ----               -------
  Normal  Scheduled       2m3s                default-scheduler  Successfully assigned game/ue5-gameserver-966c8c687-qz4sn to uc-k8s4p
  Normal  Killing         49s                 kubelet            Stopping container gameserver
  Normal  SandboxChanged  48s                 kubelet            Pod sandbox changed, it will be killed and re-created.
  Normal  Pulled          47s (x2 over 2m1s)  kubelet            Container image "ue-server-env:latest" already present on machine
  Normal  Created         47s (x2 over 2m1s)  kubelet            Created container: gameserver
  Normal  Started         46s (x2 over 2m1s)  kubelet            Started container gameserver

Dying from SandboxChanged.

Kernel logs showed no errors.

Kata Logs

$ sudo journalctl -u k3s-agent --since "10 minutes ago" | grep "9f04f8b941e445c5" | tail -50
Oct 12 04:19:50 uc-k8s4p kata[556809]: time="2025-10-12T04:19:50.989892702Z" level=warning msg="Could not add /dev/mshv to the devices cgroup" name=containerd-shim-v2 pid=556809 sandbox=9f04f8b941e445c5...
Oct 12 04:21:06 uc-k8s4p k3s[556210]: I1012 04:21:06.672345  556210 pod_container_deletor.go:80] "Container not found in pod's containers" containerID="9f04f8b941e445c5..."

Container suddenly disappearing.

Internet to the Rescue

While frustrated, found a saint with same issue and solution. Thank you so much.

https://edera.dev/stories/sandboxchanged-errors-when-systemd-and-container-runtimes-fight-over-kubernetes-pods

Took long to find this article, so writing my own. Hope it helps someone.

Checking cgroup Version

$ stat -fc %T /sys/fs/cgroup/
cgroup2fs

Confirmed cgroup2fs.

GRUB_CMDLINE_LINUX="systemd.unified_cgroup_hierarchy=0"

sudo update-grub
sudo reboot
$ stat -fc %T /sys/fs/cgroup/
tmpfs

Reverted to v1.

Victory

Back to list