New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for Pods to finish before considering Failed in Job #113860
Wait for Pods to finish before considering Failed in Job #113860
Conversation
/assign @liggitt |
5850384
to
c4f4168
Compare
/assign @mimowo |
/priority important-background |
@alculquicondor: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority important-longterm |
Limit behavior to feature gates PodDisruptionConditions and JobPodFailurePolicy and jobs with a podFailurePolicy. Change-Id: I926391cc2521b389c8e52962afb0d4a6a845ab8f
c4f4168
to
2ed15ef
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing. I have three main comments:
- I suggest adding an e2e test for the main issue, it could be similar to the already existing e2e test where we evict a running pod, but instead of checking for the DisruptionTarget condition we could check for the 137 exit code. On master such podFailurePolicy would not work properly as the pod would be running at the moment of checking (or at least is very likely to be still running).
- is the change to look for never scheduled terminating pods required? There is already code in PodGC to move unscheduled terminating pods to Failed phase (code path under:
func (gcc *PodGCController) gcUnscheduledTerminating(ctx context.Context, pods []*v1.Pod) { - similarly, is the change to consider all pods as terminated when
finishedCond != nil
required?
If changes 2. and 3. are performance optimizations then I would suggest extracting them into a dedicated PR.
// Terminating pods are counted as failed. This guarantees that orphan Pods | ||
// count as failures. | ||
// Active pods are terminated when the job has completed, thus they count as | ||
// failures as well. | ||
podTerminating := pod.DeletionTimestamp != nil || finishedCond != nil | ||
if podFinished || podTerminating || job.DeletionTimestamp != nil { | ||
considerTerminated := pod.DeletionTimestamp != nil || finishedCond != nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to simplify the semantics of considerTerminated
. In particular, in the main branch succeeded pods are excluded while inside the if
statement (when PodDisruptionConditions enabled) then succeeded pods are counted by OR with IsPodTerminal. Similarly, in the main branch, considerTerminated would be false for Failed pod (if DeletionTimestamp=nil
), whereas it is true for the if-ed code.
I suggest to compute it as
considerTerminated := podutil.IsPodTerminal(pod) || job.DeletionTimestamp != nil || finishedCond != nil
and only OR additionally inside the if with isPodFailed(pod, job, true)
.
or introduce considerFailed := pod.Phase == v1.PodFailed || finishedCond != nil
, and additionally ORed inside the loop with isPodFailed
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just preserving the existing logic (except for the variable name). The new logic (protected by the feature gates) is more similar to what you describe.
@@ -1054,7 +1065,7 @@ func (jm *Controller) trackJobStatusAndRemoveFinalizers(ctx context.Context, job | |||
needsFlush = true | |||
uncountedStatus.Succeeded = append(uncountedStatus.Succeeded, pod.UID) | |||
} | |||
} else if pod.Status.Phase == v1.PodFailed || podTerminating { | |||
} else if pod.Status.Phase == v1.PodFailed || considerTerminated { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we include terminated pod in the considerTerminated
variable consistently (see above), then we can drop the pod.Status.Phase == v1.PodFailed
condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do it when the new logic is GA
pkg/controller/job/job_controller.go
Outdated
// For now, we do so to avoid affecting all running Jobs without the | ||
// avaibility to opt-out into the old behavior. | ||
return p.Status.Phase == v1.PodFailed || | ||
podutil.IsNeverScheduled(p) // The Pod will not be marked as Failed, because it was never scheduled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if the logic here does not introduce unnecessary complication given that unscheduled terminating pods will be marked failed by the GC:
gcc.gcUnscheduledTerminating(ctx, pods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I just forgot that we had that logic. Removing and updating comment.
pkg/controller/job/job_controller.go
Outdated
// TODO(#113855): Stop limiting this behavior to Jobs with podFailurePolicy. | ||
// For now, we do so to avoid affecting all running Jobs without the | ||
// avaibility to opt-out into the old behavior. | ||
return p.Status.Phase == v1.PodFailed || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to add an e2e test to show that the PR fixes the issue. I think it could be a similar test to this one:
kubernetes/test/e2e/apps/job.go
Line 188 in 8e48df1
ginkgo.It("should allow to use the pod failure policy to not count pod disruption towards the backoffLimit", func() { |
// We can also simplify the check to remove finalizers to: | ||
// considerTerminated || job.DeletionTimestamp != nil | ||
considerTerminated = podutil.IsPodTerminal(pod) || | ||
finishedCond != nil || // The Job is terminating. Any running Pod is considered failed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this change is only loosely related to the issue. Once we start to consider pods as terminated when finishedCond!=nil
then they will be checked against the pod failure policy. This seems to make this comment obsolete:
// TODO(#113855): Remove this check when it's guaranteed that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is related, as it is preserving the logic that was already there. But you are right, this implies that the comment in pod_failure_policy is not true. Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. still, inside the if
we can skip the OR with podutil.IsPodTerminal(pod)
to be more consistent with what is outside, altough then the var name would not be very descriptive. I fine to keep as is or remove the extra condition. It should not impact the behavior anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, I would like this logic inside the conditional to be the logic for all jobs.
Keeping it like this seems like a good way to soak the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note the comment in line 1046
Good idea. Will do.
Ah great, I'll revert that bit
When the job is terminated we don't care much anymore. We should clean the running pods ASAP. |
I see, still the question remains if it's better to do this in a separate PR. |
Change-Id: I3dc05bb4ea3738604f01bf8cb5fc8cc0f6ea54ec
/assign @soltysh |
generally lgtm, two comments:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spoke with Aldo on slack in length about it, let's land it as is
/lgtm
/approve
/milestone v1.26
/triage accepted |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/priority important-soon |
My suggestion for a tweaked changelog entry: -The job controller does not consider a terminating Pod (A pod with a .metadata.deletionTimestamp) until it is terminal (not scheduled or .status.phase=Failed/Succeeded).
+When the feature gates `PodDisruptionConditions` and `JobPodFailurePolicy` are both enabled, the Job controller now does not consider a terminating Pod (a pod that has a `.metadata.deletionTimestamp`) as a failure until that Pod is terminal (either the Pod was never scheduled to a node, or its `.status.phase` is `Failed` or `Succeeded`).
-However, a replacement Pod is immediately created. Once the pod terminates, the job controller evaluates .backoffLimit and .podFailurePolicy considering this Pod.
+However, the Job controller creates a replacement Pod as soon as the termination becomes apparent. Once the pod terminates, the Job controller evaluates `.backoffLimit` and `.podFailurePolicy` for the relevant Job, taking this now-terminated Pod into consideration.
-This behavior is limited to Jobs with .spec.podFailurePolicy and when the feature gates PodDisruptionConditions and JobPodFailurePolicy are enabled.
+This behavior is limited to Jobs with `.spec.podFailurePolicy` set, and only when those two feature gates are both enabled.
-If either of these requirements is not satisfied, the job controller counts a terminating Pod as an immediate failure, regardless if the Pod later terminates with phase=Succeeded.
+If either of these requirements is not satisfied, the Job controller counts a terminating Pod as an immediate failure, even if that Pod later terminates with `phase: Succeeded`. The same text as Markdown: When the feature gates `PodDisruptionConditions` and `JobPodFailurePolicy` are both enabled, the Job controller now does not consider a terminating Pod (a pod that has a `.metadata.deletionTimestamp`) as a failure until that Pod is terminal (either the Pod was never scheduled to a node, or its `.status.phase` is `Failed` or `Succeeded`).
However, the Job controller creates a replacement Pod as soon as the termination becomes apparent. Once the pod terminates, the Job controller evaluates `.backoffLimit` and `.podFailurePolicy` for the relevant Job, taking this now-terminated Pod into consideration.
This behavior is limited to Jobs with `.spec.podFailurePolicy` set, and only when those two feature gates are both enabled.
If either of these requirements is not satisfied, the Job controller counts a terminating Pod as an immediate failure, even if that Pod later terminates with `phase: Succeeded`. I have rewritten it based on my understanding of the intended meaning. I'm not sure I 100% understood what was meant. If making changes, please check that what I have put is right. Is there a k/website PR to update the docs in light of this change? |
It looks like it wasn't included as part of kubernetes/website#37242. |
Release note updated, with a tweak |
Oops. One more change I recommend - `phase: Succeeded`
+ `phase: "Succeeded"` |
Here: kubernetes/website#38040, @alculquicondor, @sftim please review |
…113860) * Wait for Pods to finish before considering Failed Limit behavior to feature gates PodDisruptionConditions and JobPodFailurePolicy and jobs with a podFailurePolicy. Change-Id: I926391cc2521b389c8e52962afb0d4a6a845ab8f * Remove check for unsheduled terminating pod Change-Id: I3dc05bb4ea3738604f01bf8cb5fc8cc0f6ea54ec
What type of PR is this?
/kind bug
What this PR does / why we need it:
Wait for Pods to finish before considering them failed, provided that the features gates
PodDisruptionConditions
andJobPodFailurePolicy
, and the Job has a podFailurePolicy.The behavior shouldn't be limited to pods with podFailurePolicy, but we do so to make the feature opt-in while PodDisruptionConditions graduates to stable.
Wait for Pods to finish before considering Failed
Which issue(s) this PR fixes:
Part of #113855
The next step would be to lift the restriction to Jobs with podFailurePolicy, possibly in v1.28.
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: