异常事件-k8s节点频繁NotReady

不时的收到个别k8s节点node notready的告警,大概如下:

Component:kubelet
Count:1
CreatedTimestamp:2019-11-06 16:42:51
Subject:Node节点不可用-default/nodenamexxx.1598ab13622ce31b
FirstTimestamp:2019-11-06 16:42:50
LastTimestamp:2019-11-06 16:42:50
Name:nodenamexxx.1598ab13622ce31b
Namespace:default
Reason:NodeNotReady

但再去get node时得到的又往往是ready,或很快又恢复到ready状态。

追踪node节点的kubelet日志关键字有:failed to patch status,看代码:

kubelet的心跳机制

v1.9 pkg/kubelet/kubelet.go

if kl.kubeClient != nil {
  // Start syncing node status immediately, this may set up things the runtime needs to run.
  go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
}

func (kl *Kubelet) syncNodeStatus() {
	if kl.kubeClient == nil || kl.heartbeatClient == nil {
		return
	}
	if kl.registerNode {
		kl.registerWithAPIServer()
	}
	if err := kl.updateNodeStatus(); err != nil {
		glog.Errorf("Unable to update node status: %v", err)
	}
}

// updateNodeStatus updates node status to master with retries.
func (kl *Kubelet) updateNodeStatus() error {
	for i := 0; i < nodeStatusUpdateRetry; i++ {
		if err := kl.tryUpdateNodeStatus(i); err != nil {
			if i > 0 && kl.onRepeatedHeartbeatFailure != nil {
				kl.onRepeatedHeartbeatFailure()
			}
			glog.Errorf("Error updating node status, will retry: %v", err)
		} else {
			return nil
		}
	}
	return fmt.Errorf("update node status exceeds retry count")
}

func (kl *Kubelet) tryUpdateNodeStatus(tryNumber int) error {
	opts := metav1.GetOptions{}
	if tryNumber == 0 {
		util.FromApiserverCache(&opts)
	}
	node, err := kl.heartbeatClient.Nodes().Get(string(kl.nodeName), opts)
	if err != nil {
		return fmt.Errorf("error getting node %q: %v", kl.nodeName, err)
	}

	originalNode := node.DeepCopy()
	if originalNode == nil {
		return fmt.Errorf("nil %q node object", kl.nodeName)
	}

	kl.updatePodCIDR(node.Spec.PodCIDR)

	kl.setNodeStatus(node)
	// Patch the current status on the API server
	updatedNode, err := nodeutil.PatchNodeStatus(kl.heartbeatClient, types.NodeName(kl.nodeName), originalNode, node)
	if err != nil {
		return err
	}
	kl.volumeManager.MarkVolumesAsReportedInUse(updatedNode.Status.VolumesInUse)
	return nil
}

通过打日志追踪到最后这个方法很慢:tryUpdateNodeStatus,具体是下面三行,但这几个动作并不应该太慢: opts := metav1.GetOptions{}, kl.updatePodCIDR(node.Spec.PodCIDR) kl.setNodeStatus(node) 埋点日志发现有时有1分半甚至更久的延迟。

这和事件的产生有关系吗?

notready的事件会在什么场景下产生?

pkg/controller/node_controller.go 的monitorNodeStatus()方法会对接kubelet持续发来的心跳,验证状态,如果有异常则将状态改为unknown,甚至会在超过一定时间未恢复后对节点上的所有pod进行驱逐。

// monitorNodeStatus verifies node status are constantly updated by kubelet, and if not,
// post "NodeReady==ConditionUnknown". It also evicts all pods if node is not ready or
// not reachable for a long period of time.
func (nc *Controller) monitorNodeStatus() error{
  // Report node event.
			if currentReadyCondition.Status != v1.ConditionTrue && observedReadyCondition.Status == v1.ConditionTrue {
				util.RecordNodeStatusChange(nc.recorder, node, "NodeNotReady")
				if err = util.MarkAllPodsNotReady(nc.kubeClient, node); err != nil {
					utilruntime.HandleError(fmt.Errorf("Unable to mark all pods NotReady on node %v: %v", node.Name, err))
				}
			}
}

当node的原始状态是ready,但当前状态不为ready,则发出该事件,关键还在于currentReadyCondition.Status

gracePeriod, observedReadyCondition, currentReadyCondition, err = nc.tryUpdateNodeStatus(node)
 //...
 gracePeriod = nc.nodeMonitorGracePeriod
 //...
if nc.now().After(savedNodeStatus.probeTimestamp.Add(gracePeriod)) {
  glog.V(4).Infof("node %v hasn't been updated for %+v. Last ready condition is: %+v",
  				node.Name, nc.now().Time.Sub(savedNodeStatus.probeTimestamp.Time), observedReadyCondition)
  			if observedReadyCondition.Status != v1.ConditionUnknown {
  				currentReadyCondition.Status = v1.ConditionUnknown
  				currentReadyCondition.Reason = "NodeStatusUnknown"
  				currentReadyCondition.Message = "Kubelet stopped posting node status."
  				// LastProbeTime is the last time we heard from kubelet.
  				currentReadyCondition.LastHeartbeatTime = observedReadyCondition.LastHeartbeatTime
  				currentReadyCondition.LastTransitionTime = nc.now()
  			}

即上一次此刻如果晚于心跳时间加上一个gracePeriod,则代表心跳发送太晚了,存在异常,将其状态置为unknown并记录日志。 nodeMonitorGracePeriod是node-monitor-grace-period 的默认值为 40s,符合这个条件。

所以kubelet的tryUpdateNodeStatus卡住超过40秒的话,会触发该事件。

通过参数 –node-status-update-frequency 指定上报频率,默认是 10s 上报一次,即每个节点1分钟6次。 这里利用k8swatch,开启对node资源的watch,追踪update类别的动作,发现个别节点心跳整体偏少:

  • 找到1小时内心跳数最低的节点:
  • 心跳数最低节点的心跳频率:
  • 心跳数最低节点的心跳曲线异常期间的宿主机各项指标状态,发现cpu iowait状态存在明显异常,且和心跳异常曲线正相关:

pidstat追踪io较高的进程,发现该节点上某应用线上开启了debug级别的日志,日志太多导致IO压力大,驱逐该pod到其他节点,则本节点恢复但其他节点心跳异常。

另外,k8s1.13版本开始引入lease替代node的心跳,每次不更新宿主机上的所有状态数据,但默认没启用,可以启用它,以减少每次update的数据量,降低延迟。

但为啥kubelet的tryUpdateNodeStatus卡住? 未完待续…