Welcome to Software Development on Codidact!
Will you help us build our independent community of developers helping developers? We're small and trying to grow. We welcome questions about all aspects of software development, from design to code to QA and more. Got questions? Got answers? Got code you'd like someone to review? Please join us.
Keep failed CronJob Pods around in Kubernetes
It seems like pods created by a Kubernetes CronJob (or rather Job) are deleted as soon as they fail. This makes it difficult to figure out what went wrong (unless you have Prometheus or some other log aggregator in use).
Is there a way to keep the pods of the failed Jobs around for a while to facilitate post mortem analysis?
1 answer
There are a few fields that play into this, but the gist of it is to set the Pod's restartPolicy
to Never
. This doesn't mean that the Job only tries once. Rather, instead of restarting the container inside the Pod on failure, a new Pod will get scheduled until the Job's backoffLimit
is reached (default being 6 at the moment). Finally the CronJob's failedJobsHistoryLimit
decides whether the failed Job together with its Pods will be kept around. It defaults to 1.
Here's an example of a debuggable CronJob, adapted from the official example (CC BY 4.0, The Kubernetes Authors):
apiVersion: batch/v1
kind: CronJob
metadata:
name: hello
spec:
schedule: "* * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox:1.28
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
# This is the key bit.
# Defaults are okay for the other discussed fields.
restartPolicy: Never
0 comment threads