Adjustment of Pod’s Tolerance Time for Node Exceptions
1. Principle Explanation
When a node in a Kubernetes cluster enters an abnormal state, there is a waiting period before Pods on the node are evicted. For critical services, can this waiting time be adjusted to ensure Pods are evicted promptly and reconstructed on healthy nodes when node anomalies occur?
To solve this problem, we must first understand how Kubernetes evicts Pods when nodes are abnormal.
In Kubernetes 1.13 and later versions, the TaintBasedEvictions and TaintNodesByCondition feature gates are enabled by default. Node and Pod lifecycle management is handled through node Conditions and Taints. Kubernetes continuously checks the status of all nodes, sets corresponding Conditions, applies Taints to nodes based on their Conditions, and then evicts Pods from nodes based on these Taints.
When creating a Pod, a tolerationSeconds parameter is added by default, specifying how long the Pod will remain on an abnormal node (e.g., in a NotReady state).
So, the time from a node being abnormal to a Pod being evicted is determined by two parameters: 1. The time from the actual node exception to being judged unhealthy; 2. The Pod’s tolerance time for unhealthy nodes.
In the Kubernetes cluster, the default time from the actual node exception to being judged unhealthy is 40s, and the Pod’s tolerance time for NotReady nodes is 5min, which means that after the actual node exception for 5min40s (340s), the Pod on the node will be evicted.
2. Adjust the Time the Node is Marked Unhealthy
The ControllerManager parameter --node-monitor-grace-period
controls the maximum allowed unresponsive duration before marking a node unhealthy. The default value of this parameter is 40s, and it must be N times larger than Kubelet’s nodeStatusUpdateFrequency
parameter (the time interval for Kubelet to report the node status to the master node); where N refers to the number of retries Kubelet sends node status.
If you need to modify this parameter, please perform the following operations on each of the three Master nodes:
-
Add the parameter
--node-monitor-grace-period=20s
to the ControllerManager configuration file/etc/kubernetes/controller-manager
to adjust the tolerance time for marking a node unhealthy to 20s, back up the configuration file before modifying; -
Run
systemctl restart kube-controller-manager
to restart ControllerManager; -
Run
systemctl status kube-controller-manager
to confirm that the ControllerManager status isactive
.
3. Adjust the Pod’s Tolerance Time for Unhealthy Nodes
When creating a Pod, if not specifically specified, the node controller will add the following taints to the Pod:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
This automatically added tolerance means that when one of the issues (NotReady / UnReachable) is detected, the Pod can continue to stay and run on the current node for 5 minutes by default.
Note: When Pods in DaemonSet are created, NoExecute tolerations added automatically for unreachable / not-ready taints won’t specify tolerationSeconds, ensuring that Pods in DaemonSet will never be evicted when the corresponding issue occurs.
3.1 Adjust Default Tolerance Duration
The tolerance duration for unreachable / not-ready taints that Kubernetes automatically adds to Pods is controlled by related parameters in the APIServer. If you need to modify it,please perform the following operations on each of the three Master nodes:
-
Add the parameters
--default-not-ready-toleration-seconds=100
and--default-unreachable-toleration-seconds=100
to the APIServer configuration file/etc/kubernetes/apiserver
to adjust the tolerance time (in seconds, 300 by default) for the NotReady:NoExecute and Unreachable:NoExecute taints to 100s, back up the configuration file before modifying; -
Run
systemctl restart kube-apiserver
to restart APIServer. -
Run
systemctl status kube-apiserver
to confirm that the APIServer status isactive
.
3.2 Adjust Existing Pod Tolerance Duration
Taking the Pod created through Deployment as an example, we need to modify the Tolerations parameter in the existing Deployment using the kubectl patch
command.
First, create the patch file tolerationseconds.yaml, as shown in the example:
spec:
template:
spec:
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
# Adjust the Pod's tolerance time for Unreachable:NoExecute taint to 100s
tolerationSeconds: 100
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute"
# Adjust the Pod's tolerance time for NotReady:NoExecute taint to 100s
tolerationSeconds: 100
Then run the command kubectl patch deploy your-deployment --patch "$(cat tolerationseconds.yaml)"
to modify the Deployment. After the modification, you will find that the tolerance duration of the corresponding taint in the Pod controlled by this Deployment has been modified.
⚠️ This operation will cause Deployment to rebuild all Pods, so please do it during the low point of business.