kubelet对应用healthCheck接口的重定向处理

下午,收到一条异常事件的告警,Reason是ProbWarning,之前没见过。

下午给新版本的k8s(1.14)添加了事件监控,没一会便收到上述告警,之前版本(1.9)从没遇到过,而且服务一直正常,pod也没重启过,也没有Unhealthy的事件,为啥还会有Warning呢?

手动访问应用的healthCheck接口,发现跳转到了cas服务,url从serviceA.xx.xx变成了cas.xx.xx/serviceA.xx.xx,跨域名了,是由它引起的吗?

原因

查到一个k8s1.14的changelog里面提到了生成ProbWarning这个reason,原文:

kubelet
The deprecated --experimental-fail-swap-on flag has been removed (#69552, @Pingan2017)
Health check (liveness & readiness) probes using an HTTPGetAction will no longer follow redirects to different hostnames from the original probe request. Instead, these non-local redirects will be treated as a Success (the documented behavior). In this case an event with reason "ProbeWarning" will be generated, indicating that the redirect was ignored. If you were previously relying on the redirect to run health checks against different endpoints, you will need to perform the healthcheck logic outside the Kubelet, for instance by proxying the external endpoint rather than redirecting to it. (#75416, @tallclair)

原理

顺着找到了对应的PR 以及它对应的issue,它是说kubelet在做healthCheck的时候,执行了HTTP跳转,并将跳转后的结果作为凭据,不符合文档描述:

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

理解为检查是不包含redirect的,即便跳转,也是同主机不同路径的跳转是可以被接收的,跨主机的跳转不被大家认同,但go的httplib默认则是follow了跳转,因此要调整为: 只follow本地同域名的重定向,而对有跨域跳转,则不follow,并直接认为是健康的(不再跟踪是否真正健康),并记录一条警告事件,也就是这里的ProbWarning

源码

禁跨域重定向

具体实现是构建client的时候,单独声明了CheckRedirect,以不做跳转和死循环处理。

client := &http.Client{
		CheckRedirect: redirectChecker(false),
	}

func redirectChecker(followNonLocalRedirects bool) func(*http.Request, []*http.Request) error {
	if followNonLocalRedirects {
		return nil // Use the default http client checker.
	}

	return func(req *http.Request, via []*http.Request) error {
		if req.URL.Hostname() != via[0].URL.Hostname() {
			return http.ErrUseLastResponse
		}
		// Default behavior: stop after 10 redirects.
		if len(via) >= 10 {
			return errors.New("stopped after 10 redirects")
		}
		return nil
	}
}

这里的的关键是如下

if req.URL.Hostname() != via[0].URL.Hostname() {
  fmt.Printf("currentHost:%s oriHost:%s error:%s", req.URL.Hostname(), via[0].URL.Hostname(), http.ErrUseLastResponse)
  return http.ErrUseLastResponse
}

对比当前域名和之前的域名是否一致,如果否则return一个error而非nil这样client就会停止redirect。

解释

比如对如下代码:

http.HandleFunc("/redirect-to-null", func(w http.ResponseWriter, r *http.Request) {
  log.Printf("redirect-to-null")
  http.Redirect(w, r, "http://null", http.StatusFound)
})

如果return http.ErrUseLastResponse,则在请求时直接返回: currentHost:null oriHost:localhost error:net/http: use last response string:Found.

result:warning

而如果return nil,则会继续重定向: currentHost:null oriHost:localhost error:net/http: use last response string:Get http://null: dial tcp: lookup null: no such host result:failure

可参考:net/http/client.go


	// CheckRedirect specifies the policy for handling redirects.
	// If CheckRedirect is not nil, the client calls it before
	// following an HTTP redirect. The arguments req and via are
	// the upcoming request and the requests made already, oldest
	// first. If CheckRedirect returns an error, the Client's Get
	// method returns both the previous Response (with its Body
	// closed) and CheckRedirect's error (wrapped in a url.Error)
	// instead of issuing the Request req.
	// As a special case, if CheckRedirect returns ErrUseLastResponse,
	// then the most recent response is returned with its body
	// unclosed, along with a nil error.
	//
	// If CheckRedirect is nil, the Client uses its default policy,
	// which is to stop after 10 consecutive requests.
	CheckRedirect func(req *Request, via []*Request) error
事件来源:
if result == probe.Warning {
  if ref, hasRef := pb.refManager.GetRef(containerID); hasRef {
    pb.recorder.Eventf(ref, v1.EventTypeWarning, events.ContainerProbeWarning, "%s probe warning: %s", probeType, output)
  }
  klog.V(3).Infof("%s probe for %q succeeded with a warning: %s", probeType, ctrName, output)
} else {
  klog.V(3).Infof("%s probe for %q succeeded", probeType, ctrName)
}

结尾

健康检查接口应该尽可能的精简,即能体现出服务的基本状态,又能不冗长繁琐,带有重定向跳转的确实不太受欢迎,我们这里的案例是做了认证拦截,对该接口做认证的意义不大,因此接收这一设定并推进业务修改。