# Kubernetes调度

# 创建一个pod的工作流程

Kubernetes基于list-watch机制的控制器架构,实现组件间交互的解耦。

img

# 图解

  • kubectl run -> apiserver -> etcd -> scheduler -> kubelet -> docker -> container
  1. kubectl 发起一个创建pod 请求
  2. apiserver收到创建pod请求,将请求的配置写到etcd
  3. scheduler通过list/watch获取到pod配置,根据pod配置选择一个合适节点,然后将选择结果返回给apiserver
  4. kubelet获取绑定到我节点的pod
  5. kubelet调用容器引擎api创建容器,并将结果反馈给apiserver

# Pod中影响调度的主要属性

  • 资源调度依据
  • 调度策略

# 资源限制于pod调度的影响

  • 容器资源限制

    • resources.limits.cpu
    • resources.limits.memory
  • 容器使用的最小资源需求,作为容器调度时资源分配的依据

    • resources.requests.cpu
    • resources.requests.memory
  • 单位说明

    • CPU单位:可以写m也可以写浮点数,例如0.5=500m,1=1000m

      m 毫核, 1000m等同于1核
      0.5 = 500m
      1c = 1000m
      2c = 2000m
      
    • 内存单位:Mi(数值上等同于MB)

  • 查询node的资源占用情况和已分配情况

kubectl describe node <node名称> 
  • 一个简单的限制cpu和内存的yaml

    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        run: nginx
      name: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
        resources:
          requests:
            memory: "128Mi"
            cpu: "250m"
          limits:
            memory: "150Mi"
            cpu: "300m"
    
    • 调度该pod时,需要分配内存128MB,CPU0.25核
    • 该pod运行期间,可使用的最大内存为150MB,0.3核CPU
  • 这里需要注意几点

    • requests不能大于limits的限制,大于会直接报错
    • requests不能大于机器可分配的资源,大于会让pod处于pending状态,无法调度

# nodeSelector & nodeAffinity

# nodeSelector

用于将Pod调度到匹配Label的Node上,如果没有匹配的标签会调度失败。

  • 作用

    • 完全匹配节点标签
    • 固定Pod到特定节点
  • 给节点打标签

kubectl label nodes [node] key=value

例如: kubectl label nodes liuxiaolu-node env=dev
解释:给liuxiaolu-node这个节点打上key是env,value是dev的tag

# 创建一个pod,并调度到指定标签的yaml示例

  • yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx-node-selector
spec:
  nodeSelector:
    env: dev
  containers:
  - image: nginx
    name: nginx
  • yaml说明

    • 创建一个名为nginx-node-selector的pod
    • 将pod调度到打有env=dev这个标签的节点
  • 执行一下

$ kubectl label node liuxiaolu-node env=dev

$ kubectl get node -l env=dev
NAME             STATUS   ROLES    AGE   VERSION
liuxiaolu-node   Ready    <none>   47h   v1.21.0

$ kubectl apply -f nginx-node-selector.yaml
pod/nginx-node-selector created

$ kubectl get pod -o wide
NAME                  READY   STATUS              RESTARTS   AGE   IP       NODE             NOMINATED NODE   READINESS GATES
nginx-node-selector   0/1     ContainerCreating   0          24s   <none>   liuxiaolu-node   <none>           <none>
  • 命令说明
    • 给liuxiaolu-node这个节点打上env=dev的tag
    • 根据标签筛选,查看liuxiaolu-node这个节点是否已经打上了指定的标签
    • 执行nginx-node-selector.yaml创建pod
    • 查看新创建的pod是否是按我们所想到调度到了liuxiaolu-node这个节点
    • 结果符合预期

# nodeAffinity

节点亲和,类似于nodeSelector,可以根据节点上的标签来约束Pod可以调度到哪些节点。

  • 相比nodeSelector

    • 匹配有更多的逻辑组合,不只是字符串的完全相等, 而是根据操作符进行匹配
    • 调度分为软策略和硬策略,而不只是硬性要求
      • 硬(required): 必须满足
      • 软(preferred): 尝试满足,但不保证
  • 操作符:In、NotIn、Exists、DoesNotExist、Gt、Lt

# 硬策略(required)

  • yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx-require-affinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: env
            operator: In
            values:
            - dev
            - test
            - prod
  containers:
  - image: nginx
    name: nginx
  • yaml说明

    • 创建一个名为nginx-require-affinity的pod
    • 该pod将会调度到打有标签env=dev或env=test或env=prod的其中一个节点
  • 执行一下(由于前面已经给liuxiaolu-node这个节点打上了env=dev的标签,这里不再赘述)

$ kubectl apply -f nginx-require-affinity.yaml
pod/nginx-require-affinity created

$ kubectl get pod -o wide
NAME                     READY   STATUS    RESTARTS   AGE   IP              NODE             NOMINATED NODE   READINESS GATES
nginx-node-selector      1/1     Running   0          15m   10.244.17.171   liuxiaolu-node   <none>           <none>
nginx-require-affinity   1/1     Running   0          22s   10.244.17.172   liuxiaolu-node   <none>           <none>
  • 命令说明

    • 执行nginx-require-affinity.yaml创建pod
    • 查看新创建的pod是否是按我们所想到调度到了liuxiaolu-node这个节点
    • 结果符合预期
  • 修改一下这个yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx-require-affinity-pending
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: env
            operator: In
            values:
            - test
            - prod
  containers:
  - image: nginx
    name: nginx
  • yaml说明

    • 移除env=dev这个tag,我们将不再有node节点符合要求
    • 验证结果是否如我们预期的无法调度
  • 执行一下

$ kubectl apply -f nginx-require-affinity-pending.yaml
pod/nginx-require-affinity-pending created

$ kubectl get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP              NODE             NOMINATED NODE   READINESS GATES
nginx-node-selector              1/1     Running   0          21m     10.244.17.171   liuxiaolu-node   <none>           <none>
nginx-require-affinity           1/1     Running   0          6m37s   10.244.17.172   liuxiaolu-node   <none>           <none>
nginx-require-affinity-pending   0/1     Pending   0          21s     <none>          <none>           <none>           <none>
  • 命令说明
    • 可以看到,新增的pod如我们预期的一样处于pending状态,无法满足硬策略,无法调度
    • 结果符合预期

# 软策略(preferred)

  • yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx-preferred-affinity
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: env
                operator: In
                values:
                  - dev
  containers:
    - image: nginx
      name: nginx
  • yaml说明

    • 创建名为nginx-preferred-affinity的pod
    • 该pod将优先调度到满足打有标签env=dev的tag的node
  • 执行一下

$ kubectl apply -f nginx-preferred-affinity.yaml
pod/nginx-preferred-affinity created

$ kubectl get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP              NODE             NOMINATED NODE   READINESS GATES
nginx-node-selector              1/1     Running   0          27m     10.244.17.171   liuxiaolu-node   <none>           <none>
nginx-preferred-affinity         1/1     Running   0          22s     10.244.17.173   liuxiaolu-node   <none>           <none>
nginx-require-affinity           1/1     Running   0          12m     10.244.17.172   liuxiaolu-node   <none>           <none>
nginx-require-affinity-pending   0/1     Pending   0          6m13s   <none>          <none>           <none>           <none>
  • 命令说明

    • 可以看到nginx-preferred-affinity依然调度到了liuxiaolu-node这个节点
    • 结果符合我们的预期
    • 而nginx-require-affinity-pending依然处于pending状态也验证了确实是无法调度,而不是由于等待的时间不够导致没有创建好
  • 和硬策略相同的,还是继续修改一下这个yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx-preferred-affinity-running
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 1
          preference:
            matchExpressions:
              - key: env
                operator: In
                values:
                  - prod
  containers:
    - image: nginx
      name: nginx
  • yaml说明

    • 将env=dev的键改为env=prod
    • 环境中没有满足env=prod标签的node
    • 验证无法匹配之后,pod是否依然可以调度
  • 执行一下

$ kubectl apply -f nginx-preferred-affinity-running.yaml
pod/nginx-preferred-affinity-running created

$ kubectl get pod -o wide
NAME                               READY   STATUS    RESTARTS   AGE     IP              NODE             NOMINATED NODE   READINESS GATES
nginx-node-selector                1/1     Running   0          33m     10.244.17.171   liuxiaolu-node   <none>           <none>
nginx-preferred-affinity           1/1     Running   0          5m43s   10.244.17.173   liuxiaolu-node   <none>           <none>
nginx-preferred-affinity-running   1/1     Running   0          27s     10.244.17.174   liuxiaolu-node   <none>           <none>
nginx-require-affinity             1/1     Running   0          17m     10.244.17.172   liuxiaolu-node   <none>           <none>
nginx-require-affinity-pending     0/1     Pending   0          11m     <none>          <none>           <none>           <none>
[
  • 命令说明
    • 可以看到,虽然没有符合tag的node节点,nginx-preferred-affinity-running依然调度成功,处于running状态
    • 结果符合预期

# 当pod无法调度之后,给node打上符合要求的标签,pod将可以立即重新调度

  • 给liuxiaolu-node=prod,这符合nginx-require-affinity-pending的调度策略

  • 执行一下

$ kubectl label node liuxiaolu-node env=prod --overwrite
node/liuxiaolu-node labeled

$ kubectl get pod -o wide
NAME                               READY   STATUS              RESTARTS   AGE     IP              NODE             NOMINATED NODE   READINESS GATES
nginx-node-selector                1/1     Running             0          40m     10.244.17.171   liuxiaolu-node   <none>           <none>
nginx-preferred-affinity           1/1     Running             0          12m     10.244.17.173   liuxiaolu-node   <none>           <none>
nginx-preferred-affinity-running   1/1     Running             0          7m31s   10.244.17.174   liuxiaolu-node   <none>           <none>
nginx-require-affinity             1/1     Running             0          24m     10.244.17.172   liuxiaolu-node   <none>           <none>
nginx-require-affinity-pending     0/1     ContainerCreating   0          18m     <none>          liuxiaolu-node   <none>           <none>
  • 可以看到,nginx-require-affinity-pending已经成功开始调度

# Taints & Tolerations

# Taints

避免Pod调度到特定Node上

  • 应用场景

    • 专用节点,例如配备了特殊硬件的节点
    • 基于Taint的驱逐
  • 设置污点

kubectl taint node [node] key=value:[effect]

其中[effect] 可取值

  • NoSchedule: 一定不能被调度。

  • PreferNoSchedule: 尽量不要调度。

  • NoExecute: 不仅不会调度,还会驱逐Node上已有的Pod。

  • 查询node上的污点

kubectl describe node | grep Taints
  • 去掉污点:
kubectl taint node [node] key:[effect]-
  • yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx-taints
spec:
  containers:
    - image: nginx
      name: nginx
  • yaml说明
    • 使用同一个yaml进行测试pod是否能够被调度

# 一定不能被调度(NoSchedule)

  • 执行一下
$ kubectl taint node liuxiaolu-node forbid=okay:NoSchedule
node/liuxiaolu-node tainted

$ kubectl apply -f nginx-taints.yaml
pod/nginx-taints created

$ kubectl get pod -o wide
NAME                               READY   STATUS              RESTARTS   AGE     IP              NODE               NOMINATED NODE   READINESS GATES
nginx-node-selector                1/1     Running             0          69m     10.244.17.171   liuxiaolu-node     <none>           <none>
nginx-preferred-affinity           1/1     Running             0          41m     10.244.17.173   liuxiaolu-node     <none>           <none>
nginx-preferred-affinity-running   1/1     Running             0          36m     10.244.17.174   liuxiaolu-node     <none>           <none>
nginx-require-affinity             1/1     Running             0          53m     10.244.17.172   liuxiaolu-node     <none>           <none>
nginx-require-affinity-pending     1/1     Running             0          47m     10.244.17.175   liuxiaolu-node     <none>           <none>
nginx-taints                       0/1     ContainerCreating   0          4m54s   <none>          liuxiaolu-master   <none>           <none>
  • 命令说明
    • 给liuxiaolu-node打上forbid=okay的污点,并设置为一定不可调度
    • 执行nginx-taints.yaml创建pod
    • 查看pod调度情况,pod不再调度到liuxiaolu-node这个节点,而是调度到了其他节点
    • 结果符合预期

# 尽量不要调度(PreferNoSchedule)

  • 执行一下
$ kubectl taint node liuxiaolu-master forbid=okay:NoSchedule 
node/liuxiaolu-master tainted

$ kubectl taint node liuxiaolu-node forbid=okay:PreferNoSchedule --overwrite 
node/liuxiaolu-node modified

$ kubectl delete pod nginx-taints
pod "nginx-taints" deleted

$ kubectl apply -f nginx-taints.yaml
pod/nginx-taints created

$ kubectl get pod -o wide
NAME                               READY   STATUS              RESTARTS   AGE   IP              NODE             NOMINATED NODE   READINESS GATES
nginx-node-selector                1/1     Running             0          75m   10.244.17.171   liuxiaolu-node   <none>           <none>
nginx-preferred-affinity           1/1     Running             0          47m   10.244.17.173   liuxiaolu-node   <none>           <none>
nginx-preferred-affinity-running   1/1     Running             0          42m   10.244.17.174   liuxiaolu-node   <none>           <none>
nginx-require-affinity             1/1     Running             0          59m   10.244.17.172   liuxiaolu-node   <none>           <none>
nginx-require-affinity-pending     1/1     Running             0          53m   10.244.17.175   liuxiaolu-node   <none>           <none>
nginx-taints                       0/1     ContainerCreating   0          14s   <none>          liuxiaolu-node   <none>           <none>
  • 命令说明
    • 给liuxiaolu-master节点打上forbid=okay的污点,并设置为一定不可调度
    • 覆盖liuxiaolu-node的forbid=okay的污点,更新为尽量不要调度
    • 删除已经部署的pod nginx-taints
    • 执行nginx-taints.yaml重新部署nginx-taints
    • 查看调度结果
    • 可以看到,master节点已经被禁用掉,此时将node节点设置为尽量不可调度,pod还是调度到了pod节点,结果符合预期

# 不仅不会调度,还会驱逐Node上已有的Pod(NoExecute)

  • 执行一下
$ kubectl taint node liuxiaolu-node forbid=okay:NoExecute --overwrite 
node/liuxiaolu-node modified

$ kubectl get pod -o wide
No resources found in default namespace.

$ kubectl apply -f nginx-taints.yaml
NAME           READY   STATUS    RESTARTS   AGE   IP       NODE     NOMINATED NODE   READINESS GATES
nginx-taints   0/1     Pending   0          9s    <none>   <none>   <none>           <none>inx-taints created

  • 命令说明
    • 更新liuxiaolu-node节点的污点策略,设置为NoExecute
    • 查看结果,发现pod被全部驱逐
    • 此时在执行nginx-taints.yaml创建pod发现pod无法调度
    • 无法调度的原因是因为master节点设置为一定不可调度,node节点设置为不能调度,且驱逐所有node
    • 结果符合预期

# Tolerations

允许Pod调度到持有Taints的Node上

  • yaml
apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx-taints-toleration
spec:
  containers:
    - image: nginx
      name: nginx
  tolerations:
    - key: forbid
      operator: "Equal"
      value: "okay"
      effect: "NoSchedule"
  • yaml说明

    • 创建名为nginx-taints-toleration的pod
    • 容忍污点是forbid=okay,且不可调度的节点
  • 执行一下

$ kubectl apply -f nginx-taints-toleration.yaml
pod/nginx-taints-toleration created

$ kubectl get pod -o wide
NAME                      READY   STATUS              RESTARTS   AGE     IP       NODE               NOMINATED NODE   READINESS GATES
nginx-taints              0/1     Pending             0          4m17s   <none>   <none>             <none>           <none>
nginx-taints-toleration   0/1     ContainerCreating   0          16s     <none>   liuxiaolu-master   <none>           <none>
  • 命令说明
    • 执行nginx-taints-toleration.yaml创建pod
    • 查看执行结果发现,虽然master节点设置为一定不可调度,pod还是被调度到了master节点
    • 结果符合预期

# nodeName

  • 前提

    • master节点设置污点策略为一定不可调度
    • node节点设置污点策略为驱逐pod
  • yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: nginx
  name: nginx-taints-node-name
spec:
  nodeName: liuxiaolu-master
  containers:
    - image: nginx
      name: nginx
  • 执行一下
$ kubectl apply -f nginx-taints-node-name.yaml
pod/nginx-taints-node-name created

$ kubectl get pod -o wide
NAME                      READY   STATUS              RESTARTS   AGE     IP              NODE               NOMINATED NODE   READINESS GATES
nginx-taints              0/1     Pending             0          9m17s   <none>          <none>             <none>           <none>
nginx-taints-node-name    0/1     ContainerCreating   0          2s      <none>          liuxiaolu-master   <none>           <none>
nginx-taints-toleration   1/1     Running             0          5m16s   10.244.10.202   liuxiaolu-master   <none>           <none>
  • 命令说明
    • 执行nginx-taints-node-name.yaml创建pod
    • 查看调度结果发现,虽然master节点设置为不可调度,pod依然调度到了master节点
    • 结果符合预期
    • 这里多提一句,这是因为指定nodeName的方式,不经过调度器,精准调度到某一节点上,一般不会使用这个策略

# DaemonSet控制器

  • 功能

    • 在每一个Node上运行一个Pod
    • 新加入的Node也同样会自动运行一个Pod
  • 应用场景

    • 网络插件
    • 监控Agent
    • 日志Agent
  • 通过查看calico的pod验证

    • 执行一下
    $ kubectl get pod -o wide -n kube-system | grep calico-node
    calico-node-bdkkr                          1/1     Running   0          37h    10.69.1.161     liuxiaolu-node     <none>           <none>
    calico-node-hd5mx                          1/1     Running   0          2d1h   10.69.1.160     liuxiaolu-master   <none>           <none>
    
    • 命令说明
      • 可以看到在master节点和node节点各有一个calico的pod
  • 前提

    • 各节点已移除所有污点
  • yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - image: nginx
          name: nginx
status: {}
  • 执行一下
$ kubectl apply -f nginx-daemon-set.yaml --validate=false
daemonset.apps/nginx created

$ kubectl get pod -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP              NODE               NOMINATED NODE   READINESS GATES
nginx-taints              1/1     Running   0          24m   10.244.17.177   liuxiaolu-node     <none>           <none>
nginx-taints-node-name    1/1     Running   0          15m   10.244.10.203   liuxiaolu-master   <none>           <none>
nginx-taints-toleration   1/1     Running   0          20m   10.244.10.202   liuxiaolu-master   <none>           <none>
nginx-wv889               1/1     Running   0          39s   10.244.17.178   liuxiaolu-node     <none>           <none>
nginx-zmh5x               1/1     Running   0          39s   10.244.10.204   liuxiaolu-master   <none>           <none>
  • 命令说明
    • 执行nginx-daemon-set.yaml创建pod并忽略验证
    • 查看结果,发现在master和node节点各拉起了一个nginx且后缀为随机数的pod
    • 结果符合预期

# 调度失败原因分析

  • 查看调度结果
kubectl get pod <NAME> -o wide
  • 查看调度失败原因
kubectl describe pod <NAME>
  • 常见原因
    • 节点CPU/内存不足
    • 有污点,没容忍
    • 没有匹配到节点标签
Last Updated: 6/26/2022, 4:44:06 PM