Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for Pods to finish before considering Failed in Job #113860

Merged
merged 2 commits into from Nov 15, 2022

Conversation

alculquicondor
Copy link
Member

@alculquicondor alculquicondor commented Nov 11, 2022

What type of PR is this?

/kind bug

What this PR does / why we need it:

Wait for Pods to finish before considering them failed, provided that the features gates PodDisruptionConditions and JobPodFailurePolicy, and the Job has a podFailurePolicy.

The behavior shouldn't be limited to pods with podFailurePolicy, but we do so to make the feature opt-in while PodDisruptionConditions graduates to stable.

Wait for Pods to finish before considering Failed

Which issue(s) this PR fixes:

Part of #113855

The next step would be to lift the restriction to Jobs with podFailurePolicy, possibly in v1.28.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

When the feature gates `PodDisruptionConditions` and `JobPodFailurePolicy` are both enabled,
the Job controller now does not consider a terminating Pod (a pod that has a `.metadata.deletionTimestamp`)
as a failure until that Pod is terminal (its `.status.phase` is `Failed` or `Succeeded`).

However, the Job controller creates a replacement Pod as soon as the termination becomes apparent.
Once the pod terminates, the Job controller evaluates `.backoffLimit` and `.podFailurePolicy` for the
relevant Job, taking this now-terminated Pod into consideration.

This behavior is limited to Jobs with `.spec.podFailurePolicy` set, and only when those two feature
gates are both enabled.
If either of these requirements is not satisfied, the Job controller counts a terminating Pod as an immediate
failure, even if that Pod later terminates with `phase: "Succeeded"`.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 11, 2022
@alculquicondor
Copy link
Member Author

/assign @liggitt

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 11, 2022
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 11, 2022
@alculquicondor alculquicondor changed the title WIP Wait for Pods to finish before considering Failed in Job Wait for Pods to finish before considering Failed in Job Nov 11, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 11, 2022
@alculquicondor
Copy link
Member Author

/assign @mimowo

@alculquicondor
Copy link
Member Author

/priority important-background

@k8s-ci-robot
Copy link
Contributor

@alculquicondor: The label(s) priority/important-background cannot be applied, because the repository doesn't have them.

In response to this:

/priority important-background

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alculquicondor
Copy link
Member Author

/priority important-longterm

@k8s-ci-robot k8s-ci-robot added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 11, 2022
Limit behavior to feature gates PodDisruptionConditions and
JobPodFailurePolicy and jobs with a podFailurePolicy.

Change-Id: I926391cc2521b389c8e52962afb0d4a6a845ab8f
Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing. I have three main comments:

  1. I suggest adding an e2e test for the main issue, it could be similar to the already existing e2e test where we evict a running pod, but instead of checking for the DisruptionTarget condition we could check for the 137 exit code. On master such podFailurePolicy would not work properly as the pod would be running at the moment of checking (or at least is very likely to be still running).
  2. is the change to look for never scheduled terminating pods required? There is already code in PodGC to move unscheduled terminating pods to Failed phase (code path under:
    func (gcc *PodGCController) gcUnscheduledTerminating(ctx context.Context, pods []*v1.Pod) {
    )
  3. similarly, is the change to consider all pods as terminated when finishedCond != nil required?

If changes 2. and 3. are performance optimizations then I would suggest extracting them into a dedicated PR.

// Terminating pods are counted as failed. This guarantees that orphan Pods
// count as failures.
// Active pods are terminated when the job has completed, thus they count as
// failures as well.
podTerminating := pod.DeletionTimestamp != nil || finishedCond != nil
if podFinished || podTerminating || job.DeletionTimestamp != nil {
considerTerminated := pod.DeletionTimestamp != nil || finishedCond != nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to simplify the semantics of considerTerminated. In particular, in the main branch succeeded pods are excluded while inside the if statement (when PodDisruptionConditions enabled) then succeeded pods are counted by OR with IsPodTerminal. Similarly, in the main branch, considerTerminated would be false for Failed pod (if DeletionTimestamp=nil), whereas it is true for the if-ed code.

I suggest to compute it as
considerTerminated := podutil.IsPodTerminal(pod) || job.DeletionTimestamp != nil || finishedCond != nil
and only OR additionally inside the if with isPodFailed(pod, job, true).
or introduce considerFailed := pod.Phase == v1.PodFailed || finishedCond != nil, and additionally ORed inside the loop with isPodFailed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just preserving the existing logic (except for the variable name). The new logic (protected by the feature gates) is more similar to what you describe.

@@ -1054,7 +1065,7 @@ func (jm *Controller) trackJobStatusAndRemoveFinalizers(ctx context.Context, job
needsFlush = true
uncountedStatus.Succeeded = append(uncountedStatus.Succeeded, pod.UID)
}
} else if pod.Status.Phase == v1.PodFailed || podTerminating {
} else if pod.Status.Phase == v1.PodFailed || considerTerminated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we include terminated pod in the considerTerminated variable consistently (see above), then we can drop the pod.Status.Phase == v1.PodFailed condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do it when the new logic is GA

// For now, we do so to avoid affecting all running Jobs without the
// avaibility to opt-out into the old behavior.
return p.Status.Phase == v1.PodFailed ||
podutil.IsNeverScheduled(p) // The Pod will not be marked as Failed, because it was never scheduled.
Copy link
Contributor

@mimowo mimowo Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if the logic here does not introduce unnecessary complication given that unscheduled terminating pods will be marked failed by the GC:

gcc.gcUnscheduledTerminating(ctx, pods)
. Maybe it is worth tough as a performance optimization, what is the main reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just forgot that we had that logic. Removing and updating comment.

// TODO(#113855): Stop limiting this behavior to Jobs with podFailurePolicy.
// For now, we do so to avoid affecting all running Jobs without the
// avaibility to opt-out into the old behavior.
return p.Status.Phase == v1.PodFailed ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to add an e2e test to show that the PR fixes the issue. I think it could be a similar test to this one:

ginkgo.It("should allow to use the pod failure policy to not count pod disruption towards the backoffLimit", func() {
, but instead of looking for DisruptionTarget condition it would look for exit code 137. On master the test would fail as the pod is matched against the policy before the pod is terminated (and thus the exit code 137 is set). Such an e2e test would also prevent a similar regression in the future.

// We can also simplify the check to remove finalizers to:
// considerTerminated || job.DeletionTimestamp != nil
considerTerminated = podutil.IsPodTerminal(pod) ||
finishedCond != nil || // The Job is terminating. Any running Pod is considered failed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this change is only loosely related to the issue. Once we start to consider pods as terminated when finishedCond!=nil then they will be checked against the pod failure policy. This seems to make this comment obsolete:

// TODO(#113855): Remove this check when it's guaranteed that the
.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is related, as it is preserving the logic that was already there. But you are right, this implies that the comment in pod_failure_policy is not true. Removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. still, inside the if we can skip the OR with podutil.IsPodTerminal(pod) to be more consistent with what is outside, altough then the var name would not be very descriptive. I fine to keep as is or remove the extra condition. It should not impact the behavior anyway.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, I would like this logic inside the conditional to be the logic for all jobs.
Keeping it like this seems like a good way to soak the logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note the comment in line 1046

@alculquicondor
Copy link
Member Author

  1. I suggest adding an e2e test for the main issue, it could be similar to the already existing e2e test where we evict a running pod, but instead of checking for the DisruptionTarget condition we could check for the 137 exit code. On master such podFailurePolicy would not work properly as the pod would be running at the moment of checking (or at least is very likely to be still running).

Good idea. Will do.

  1. is the change to look for never scheduled terminating pods required? There is already code in PodGC to move unscheduled terminating pods to Failed phase.

Ah great, I'll revert that bit

  1. similarly, is the change to consider all pods as terminated when finishedCond != nil required?

If changes 2. and 3. are performance optimizations then I would suggest extracting them into a dedicated PR.

When the job is terminated we don't care much anymore. We should clean the running pods ASAP.

@mimowo
Copy link
Contributor

mimowo commented Nov 14, 2022

When the job is terminated we don't care much anymore. We should clean the running pods ASAP.

I see, still the question remains if it's better to do this in a separate PR.
I suggest to make the PR minimal to fix the issue so that it gets cherry picked for 1.26 and ticket the other improvement independently as its not a recent regression.

Change-Id: I3dc05bb4ea3738604f01bf8cb5fc8cc0f6ea54ec
@alculquicondor
Copy link
Member Author

/assign @soltysh
WDYT about including this in 1.26.0? We could also wait for 1.26.1.

@mimowo
Copy link
Contributor

mimowo commented Nov 15, 2022

generally lgtm, two comments:

  • left a one note about possibly dropping of the unnecessary condition podutil.IsPodTerminal(pod) inside the if, please consider
  • please add an e2e test, alternatively I could take it over in a follow up PR

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke with Aldo on slack in length about it, let's land it as is
/lgtm
/approve
/milestone v1.26

@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Nov 15, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 15, 2022
@soltysh
Copy link
Contributor

soltysh commented Nov 15, 2022

/triage accepted
/priority important-longterm
/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. labels Nov 15, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 15, 2022
@alculquicondor
Copy link
Member Author

/priority important-soon

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 15, 2022
@k8s-ci-robot k8s-ci-robot merged commit 7dc36bd into kubernetes:master Nov 15, 2022
@sftim
Copy link
Contributor

sftim commented Nov 23, 2022

My suggestion for a tweaked changelog entry:

-The job controller does not consider a terminating Pod (A pod with a .metadata.deletionTimestamp) until it is terminal (not scheduled or .status.phase=Failed/Succeeded).
+When the feature gates `PodDisruptionConditions` and `JobPodFailurePolicy` are both enabled, the Job controller now does not consider a terminating Pod (a pod that has a `.metadata.deletionTimestamp`) as a failure until that Pod is terminal (either the Pod was never scheduled to a node, or its `.status.phase` is `Failed` or `Succeeded`).
-However, a replacement Pod is immediately created. Once the pod terminates, the job controller evaluates .backoffLimit and .podFailurePolicy considering this Pod.
+However, the Job controller creates a replacement Pod as soon as the termination becomes apparent. Once the pod terminates, the Job controller evaluates `.backoffLimit` and `.podFailurePolicy` for the relevant Job, taking this now-terminated Pod into consideration.
-This behavior is limited to Jobs with .spec.podFailurePolicy and when the feature gates PodDisruptionConditions and JobPodFailurePolicy are enabled.
+This behavior is limited to Jobs with `.spec.podFailurePolicy` set, and only when those two feature gates are both enabled.
-If either of these requirements is not satisfied, the job controller counts a terminating Pod as an immediate failure, regardless if the Pod later terminates with phase=Succeeded.
+If either of these requirements is not satisfied, the Job controller counts a terminating Pod as an immediate failure, even if that Pod later terminates with `phase: Succeeded`.

The same text as Markdown:

When the feature gates `PodDisruptionConditions` and `JobPodFailurePolicy` are both enabled, the Job controller now does not consider a terminating Pod (a pod that has a `.metadata.deletionTimestamp`) as a failure until that Pod is terminal (either the Pod was never scheduled to a node, or its `.status.phase` is `Failed` or `Succeeded`).
However, the Job controller creates a replacement Pod as soon as the termination becomes apparent. Once the pod terminates, the Job controller evaluates `.backoffLimit` and `.podFailurePolicy` for the relevant Job, taking this now-terminated Pod into consideration.
This behavior is limited to Jobs with `.spec.podFailurePolicy` set, and only when those two feature gates are both enabled.
If either of these requirements is not satisfied, the Job controller counts a terminating Pod as an immediate failure, even if that Pod later terminates with `phase: Succeeded`.

I have rewritten it based on my understanding of the intended meaning. I'm not sure I 100% understood what was meant. If making changes, please check that what I have put is right.

Is there a k/website PR to update the docs in light of this change?

@alculquicondor
Copy link
Member Author

It looks like it wasn't included as part of kubernetes/website#37242.
@mimowo can you send a PR?

@alculquicondor
Copy link
Member Author

Release note updated, with a tweak

@sftim
Copy link
Contributor

sftim commented Nov 23, 2022

Oops. One more change I recommend

- `phase: Succeeded`
+ `phase: "Succeeded"`

@mimowo
Copy link
Contributor

mimowo commented Nov 24, 2022

It looks like it wasn't included as part of kubernetes/website#37242. @mimowo can you send a PR?

Here: kubernetes/website#38040, @alculquicondor, @sftim please review

jaehnri pushed a commit to jaehnri/kubernetes that referenced this pull request Jan 3, 2023
…113860)

* Wait for Pods to finish before considering Failed

Limit behavior to feature gates PodDisruptionConditions and
JobPodFailurePolicy and jobs with a podFailurePolicy.

Change-Id: I926391cc2521b389c8e52962afb0d4a6a845ab8f

* Remove check for unsheduled terminating pod

Change-Id: I3dc05bb4ea3738604f01bf8cb5fc8cc0f6ea54ec
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants