Adjustment of Pod’s Tolerance Time for Node Exceptions

1. Principle Explanation

When a node in a Kubernetes cluster enters an abnormal state, there is a waiting period before Pods on the node are evicted. For critical services, can this waiting time be adjusted to ensure Pods are evicted promptly and reconstructed on healthy nodes when node anomalies occur?

To solve this problem, we must first understand how Kubernetes evicts Pods when nodes are abnormal.

In Kubernetes 1.13 and later versions, the TaintBasedEvictions and TaintNodesByCondition feature gates are enabled by default. Node and Pod lifecycle management is handled through node Conditions and Taints. Kubernetes continuously checks the status of all nodes, sets corresponding Conditions, applies Taints to nodes based on their Conditions, and then evicts Pods from nodes based on these Taints.

When creating a Pod, a tolerationSeconds parameter is added by default, specifying how long the Pod will remain on an abnormal node (e.g., in a NotReady state).

So, the time from a node being abnormal to a Pod being evicted is determined by two parameters: 1. The time from the actual node exception to being judged unhealthy; 2. The Pod’s tolerance time for unhealthy nodes.

In the Kubernetes cluster, the default time from the actual node exception to being judged unhealthy is 40s, and the Pod’s tolerance time for NotReady nodes is 5min, which means that after the actual node exception for 5min40s (340s), the Pod on the node will be evicted.

2. Adjust the Time the Node is Marked Unhealthy

The ControllerManager parameter --node-monitor-grace-period controls the maximum allowed unresponsive duration before marking a node unhealthy. The default value of this parameter is 40s, and it must be N times larger than Kubelet’s nodeStatusUpdateFrequency parameter (the time interval for Kubelet to report the node status to the master node); where N refers to the number of retries Kubelet sends node status.

If you need to modify this parameter, please perform the following operations on each of the three Master nodes:

Add the parameter --node-monitor-grace-period=20s to the ControllerManager configuration file /etc/kubernetes/controller-manager to adjust the tolerance time for marking a node unhealthy to 20s, back up the configuration file before modifying;
Run systemctl restart kube-controller-manager to restart ControllerManager;
Run systemctl status kube-controller-manager to confirm that the ControllerManager status is active.

3. Adjust the Pod’s Tolerance Time for Unhealthy Nodes

When creating a Pod, if not specifically specified, the node controller will add the following taints to the Pod:


tolerations:
- key: "node.kubernetes.io/unreachable"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300
- key: "node.kubernetes.io/not-ready"
  operator: "Exists"
  effect: "NoExecute"
  tolerationSeconds: 300

This automatically added tolerance means that when one of the issues (NotReady / UnReachable) is detected, the Pod can continue to stay and run on the current node for 5 minutes by default.

Note: When Pods in DaemonSet are created, NoExecute tolerations added automatically for unreachable / not-ready taints won’t specify tolerationSeconds, ensuring that Pods in DaemonSet will never be evicted when the corresponding issue occurs.

3.1 Adjust Default Tolerance Duration

The tolerance duration for unreachable / not-ready taints that Kubernetes automatically adds to Pods is controlled by related parameters in the APIServer. If you need to modify it,please perform the following operations on each of the three Master nodes:

Add the parameters --default-not-ready-toleration-seconds=100 and --default-unreachable-toleration-seconds=100 to the APIServer configuration file /etc/kubernetes/apiserver to adjust the tolerance time (in seconds, 300 by default) for the NotReady:NoExecute and Unreachable:NoExecute taints to 100s, back up the configuration file before modifying;
Run systemctl restart kube-apiserver to restart APIServer.
Run systemctl status kube-apiserver to confirm that the APIServer status is active.

3.2 Adjust Existing Pod Tolerance Duration

Taking the Pod created through Deployment as an example, we need to modify the Tolerations parameter in the existing Deployment using the kubectl patch command.

First, create the patch file tolerationseconds.yaml, as shown in the example:


spec:
  template:
    spec:
      tolerations:
      - key: "node.kubernetes.io/unreachable"
        operator: "Exists"
        effect: "NoExecute"
        # Adjust the Pod's tolerance time for Unreachable:NoExecute taint to 100s
        tolerationSeconds: 100
      - key: "node.kubernetes.io/not-ready"
        operator: "Exists"
        effect: "NoExecute"
        # Adjust the Pod's tolerance time for NotReady:NoExecute taint to 100s
        tolerationSeconds: 100

Then run the command kubectl patch deploy your-deployment --patch "$(cat tolerationseconds.yaml)" to modify the Deployment. After the modification, you will find that the tolerance duration of the corresponding taint in the Pod controlled by this Deployment has been modified.

⚠️ This operation will cause Deployment to rebuild all Pods, so please do it during the low point of business.