K8s下iptables-invalid-drop引起的耗时波动或者偶发断流随记

K8s下iptables-invalid-drop引起的耗时波动或者偶发断流随记

目录

1 环境前提

2 可能的诱因 & 现象结果

3 Related

3.1 github issue

环境前提

有kube-proxy组件且工作在iptables模式下

可有可无的条件: calico CNI

可能的诱因 & 现象结果

overlay POD 与集群外服务通讯

underlay与overlay网络通讯(去程overlay 回程underlay导致 asymmetrical routing 即非对称路由)

conntrack saturation? (conntrack 饱和)

产生偶发性大耗时 或者 偶发性断流现象

在 kube-proxy 所维护的filter KUBE-FORWARD iptables规则链中,存在一条规则-A KUBE-FORWARD -m conntrack --ctstate INVALID -j DROP

[root@gzu-prd ~]# iptables -L KUBE-FORWARD --line -nv

Chain KUBE-FORWARD (1 references)

num pkts bytes target prot opt in out source destination

1 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 ctstate INVALID

2 4 240 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes forwarding rules */ mark match 0x4000/0x4000

3 11412 33M ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED

4 0 0 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED

这一条规则会导致在connection track标记为INVALID的流量被DROP处理,同时这一行为目前不支持配置禁用(除非改代码重新编译)

其中关于TCP的connection track状态可以在conntrack -L 或者 cat /proc/net/nf_conntrack中查到(例如[UNREPLIED]之类的)

kube-proxy会在endpoint发生变动的时候粗暴地Flush iptables规则,导致不能简单地在KUBE-FORWARD中插入一条ACCEPT规则来避免这种问题

同样在calico所维护的各种iptables filter表中,每一个cali-fw-cali****链基本也存在规则-m conntrack --ctstate INVALID -j DROP

[root@gzu-prd ~]# iptables-save -t filter|grep INVALID

-A cali-fw-cali02fca994756 -m comment --comment "cali:Zgj-5PhkyRyRGc5v" -m conntrack --ctstate INVALID -j DROP

-A cali-fw-cali091fd1acd82 -m comment --comment "cali:vySNraYuHVkcwzZC" -m conntrack --ctstate INVALID -j DROP

-A cali-fw-cali0945b5ec7e6 -m comment --comment "cali:YpO6T4K2fN2biMqp" -m conntrack --ctstate INVALID -j DROP

-A cali-fw-cali09725d6075c -m comment --comment "cali:3Q23jKsPGkXWWHjs" -m conntrack --ctstate INVALID -j DROP

但是这一行为是可以通过FELIX_DISABLECONNTRACKINVALIDCHECK环境变量关闭

具体是否受影响,利用iptables命中计数器是观测手段之一

iptables -w 3 -L --line -nv|grep DROP|sort -rn -k 2|head -n 10

[root@gzu-prd ~]# iptables -w 3 -L --line -nv|grep DROP|sort -rn -k 2|head -n 10

2 19020 773K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:kRQn4VHUEHOpigCm */ ctstate INVALID

2 15617 937K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:DTf_pGZFWLZaqlg8 */ ctstate INVALID

2 7068 283K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:HGKygSKf4SfkbRyf */ ctstate INVALID

2 3845 154K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:t5nJs-UfMTVjRtBI */ ctstate INVALID

2 2312 139K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:h3VJGUlERuK34Tcz */ ctstate INVALID

2 2115 110K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:dTQ4mHZc378Z1e33 */ ctstate INVALID

2 1828 110K DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:kp1Tzme9aWaPgdKP */ ctstate INVALID

2 1556 62240 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:VaeGtNK_681jKlg9 */ ctstate INVALID

2 1330 69160 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:meQqPUz96UN62T8l */ ctstate INVALID

2 1025 53300 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:mIIn1Wh34t2SZwbR */ ctstate INVALID

如果在不修改kube-proxy和calico-node参数的情况下,想避免这种情况,可以简单粗暴地在集群中设置一个daemonset

kind: DaemonSet

apiVersion: apps/v1

metadata:

name: iptables-conntrack-hacker

namespace: kube-system

labels:

app: iptables-conntrack

spec:

selector:

matchLabels:

app: iptables-conntrack-hacker

template:

metadata:

name: iptables-conntrack-hacker

labels:

app: iptables-conntrack-hacker

spec:

volumes:

- name: lib-modules

hostPath:

path: /lib/modules

type: ''

- name: xtables-lock

hostPath:

path: /run/xtables.lock

type: ''

containers:

- name: iptables-conntrack-hacker

image: 'your-registry-address/kube-system/kube-proxy:v1.18.20'

command:

- /bin/sh

- '-ce'

- |

export TZ=Asia/Shanghai;

echo "$(date) Container started...";

echo "Current iptables rule state:"

iptables -w 10 -L --line -nv|grep INVALID || true

while (true)

do

iptables -C FORWARD -w 15 -m conntrack -m comment --comment "To avoid invalid tcp traffic dropped by kubelet or calico" --ctstate INVALID -j ACCEPT || \

(iptables -I FORWARD -w 10 -m conntrack -m comment --comment "To avoid invalid tcp traffic dropped by kubelet or calico" --ctstate INVALID -j ACCEPT && echo "$(date) Adding iptables rules ...");

sleep 60

done

resources:

limits:

cpu: 250m

memory: 256Mi

requests:

cpu: 100m

memory:64Mi

volumeMounts:

- name: lib-modules

mountPath: /lib/modules

- name: xtables-lock

mountPath: /run/xtables.lock

imagePullPolicy: IfNotPresent

securityContext:

privileged: true

runAsUser: 0

restartPolicy: Always

terminationGracePeriodSeconds: 5

dnsPolicy: ClusterFirstWithHostNet

hostNetwork: true

securityContext: {}

schedulerName: default-scheduler

tolerations:

- key: CriticalAddonsOnly

operator: Exists

- operator: Exists

effect: NoExecute

- operator: Exists

effect: NoSchedule

updateStrategy:

type: RollingUpdate

rollingUpdate:

maxUnavailable: 50%

revisionHistoryLimit: 5

这个Daemonset只有在启动的时候会去操作宿主机的iptables以粗暴地插入一条INVALID ACCEPT规则

有条件的同学可以修改为死循环并且每10 - 30秒检测一次iptables是否存在ACCEPT规则,不存在则插入

注意使用这个Daemonset还存在一个前提约束,如果使用的overlay CNI为calico,需要确认calico-node的iptables操作模式为追加模式

将 FELIX_CHAININSERTMODE环境变量要修改为Append ,否则cali-FORWARD这个链会被插在FORWARD链最前面,导致INVALID ACCEPT规则失效

Related

kube-proxy(v1.18.20) code: https://github.com/kubernetes/kubernetes/blob/1f3e19b7beb1cc0110255668c4238ed63dadb7ad/pkg/proxy/iptables/proxier.go#L1503-L1511

calico v3.16 config(FELIX_DISABLECONNTRACKINVALIDCHECK): https://docs.tigera.io/archive/v3.16/reference/felix/configuration

github issue

https://github.com/kubernetes/kubernetes/issues/74839

https://github.com/kubernetes/kubernetes/issues/94861

https://technology.lastminute.com/chasing-k8s-connection-reset-issue/

calico: https://github.com/projectcalico/calico/issues/2609

相关探索