花日

GitHub

标签

1301 字

7 分钟

蓝盾「Kubernetes 调度优化」

2025-08-09

蓝鲸

蓝鲸v7.1

背景#

在前文蓝盾「Docker公共构建机」缓存清理中我们通过分析源码，知道拉起的构建 Pod 通过 hostPath 挂载工作目录做缓存。我们接下来进一步分析创建 Pod 的流程。

部署配置#

dispatch-k8s-manager/resources/config.yaml

1
dispatch:
2
  # 调度需要使用到的label，确定构建机唯一性
3
  label: bkci.dispatch.kubenetes/core
4
  # 通过k8s watch来观察构建机状态
5
  watch:
6
    task:
7
      label: bkci.dispatch.kubenetes/watch-task
8
  builder:
9
    # 将构建机调度到指定标签节点的配置，不填写则在集群内都可以调度，优先级小于专机和特殊机器
10
    nodeSelector:
11
      label:
12
      value:
13
    # 构建机曾经调度过的节点名称列表
14
    nodesAnnotation: bkci.dispatch.kubenetes/builder-history-nodes
15
    # 容器历史资源使用相关
16
    realResource:
17
      # 监控构建机容器资源使用的 prometheus api地址， 字段为空则不开启realResource优化
18
      # 注：集群内为 集群内为 <service>.<namespace>.svc.cluster.local:<port>
19
      prometheusUrl:
20
      realResourceAnnotation: bkci.dispatch.kubenetes/builder-real-resources
21
  # 一些具有特定属性的机器，例如独特的网络策略
22
  specialMachine:
23
    label: bkci.dispatch.kubenetes/special-builder
24
  # 只给特定用户使用的专机
25
  privateMachine:
26
    label: bkci.dispatch.kubenetes/private-builder

通过 dispatch-k8s-manager 模块的配置文件，我们发现可以通过 nodeSelector、 nodesAnnotation 、realResource 等配置来设置调度策略。

源码分析#

亲和性和污点容忍#

dispatch-k8s-manager/pkg/apiserver/service/builder_start.go

1
func CreateBuilder(builder *Builder) (taskId string, err error) {
2

3
  volumes, volumeMounts := getBuilderVolumeAndMount(builder.Name, builder.NFSs)
4

5
  var replicas int32 = 1
6

7
  tolers, nodeMatches := buildDedicatedBuilder(builder)
8

9
  ...
10

11
  annotations, err := getBuilderAnnotations(builder.Name)
12
  if err != nil {
13
    return "", err
14
  }
15

16
  ...
17

18
  go task.DoCreateBuilder(
19
    taskId,
20
    &kubeclient.Deployment{
21
      Name:        builder.Name,
22
      Labels:      labels,
23
      MatchLabels: matchlabels,
24
      Replicas:    &replicas,
25
      Pod: kubeclient.Pod{
26
        Labels:      labels,
27
        Annotations: annotations,
28
        Volumes:     volumes,
29
        Containers: []kubeclient.Container{
30
          {
31
            Image:        builder.Image,
32
            Resources:    *resources,
33
            Env:          getEnvs(builder.Env),
34
            Command:      builder.Command,
35
            VolumeMounts: volumeMounts,
36
          },
37
        },
38
        NodeMatches:     nodeMatches,
39
        Tolerations:     tolers,
40
        PullImageSecret: pullImageSecret,
41
      },
42
    },
43
  )
44

45
  return taskId, nil
46
}
47

48

49
// buildDedicatedBuilder 获取污点和节点亲和度配置
50
func buildDedicatedBuilder(builder *Builder) ([]corev1.Toleration, []kubeclient.NodeMatch) {
51
    // 优先读取专机配置
52
    ...
53
    // 读取具有特殊配置的机器
54
    ...
55
    // 如果配置中配置了节点选择器则使用节点选择器
56
    ...
57
    return nil, nil
58
}
59

60
// getBuilderAnnotations 获取构建机注释配置
61
func getBuilderAnnotations(builderName string) (map[string]string, error) {
62
  ...
63
  // 获取节点记录，用来把构建机分配到已有的节点
64
  ...
65
  // 获取RealResource记录
66
  ...
67
  return result, nil
68
}

dispatch-k8s-manager/pkg/kubeclient/deployment.go

1
func CreateDeployment(dep *Deployment) error {
2
  ...
3
  // 将 NodeMatches 转为 nodeAffinity
4
  var affinity *corev1.Affinity
5
  if len(dep.Pod.NodeMatches) > 0 {
6
    var matches []corev1.NodeSelectorRequirement
7
    for _, mat := range dep.Pod.NodeMatches {
8
      matches = append(matches, corev1.NodeSelectorRequirement{
9
        Key:      mat.Key,
10
        Operator: mat.Operator,
11
        Values:   mat.Values,
12
      })
13
    }
14
    affinity = &corev1.Affinity{
15
      NodeAffinity: &corev1.NodeAffinity{
16
        RequiredDuringSchedulingIgnoredDuringExecution: &corev1.NodeSelector{
17
          NodeSelectorTerms: []corev1.NodeSelectorTerm{
18
            {
19
              MatchExpressions: matches,
20
            },
21
          },
22
        },
23
      },
24
    }
25
  }
26
  ...
27
  return nil
28
}

在 CreateBuilder 里，调度相关的两个核心参数 tolers 和 nodeMatches 都是通过 buildDedicatedBuilder(builder) 返回的，这两个参数会一起传递给 kubeclient 层，在 kubeclient 的 CreateDeployment 方法中：

NodeMatches 会被转换为 affinity.nodeAffinity，用于节点亲和调度。
Tolerations 会直接下发到 Pod 的 spec.tolerations 字段，用于污点容忍。

历史节点调度#

蓝盾源码里我们找到了有关亲和性以及污点容忍的实现，但是有关历史节点调度的实现只有通过 getBuilderAnnotations 给 Pod 设置注解。至于如何通过注解影响调度在蓝盾源码里并没有找到相关内容。

我们进一步分析发现，历史节点调度需要通过蓝盾基于K8S调度插件实现。

1
apiVersion: v1
2
kind: Pod
3
metadata:
4
  annotations:
5
    bkci.dispatch.kubenetes/builder-history-nodes: '["10.x.x.1","10.x.x.2","10.x.x.3"]'
6
  labels:
7
    bkci.dispatch.kubenetes/core: build1753761077695-ivcpmoxg
8
    bkci.dispatch.kubenetes/watch-task: t-1753785688231121886-iInjpMUr-builder-start
9
  name: build1753761077695-ivcpmoxg-c9d8fc6c9-mqhkk
10
  ...

1
package bkdevopsschedulerplugin
2

3
import (
4
    "context"
5
    "encoding/json"
6
    "k8s.io/api/core/v1"
7
    "k8s.io/kubernetes/pkg/scheduler/framework"
8
)
9

10
const nodesAnnotation = "bkci.dispatch.kubenetes/builder-history-nodes"
11
const readResourceAnnotation = "bkci.dispatch.kubenetes/builder-real-resources"
12

13
type realResourceUsage struct {
14
    Cpu    string `json:"cpu"`
15
    Memory string `json:"memory"`
16
}
17

18
func (s *SchedulerPlugin) Score(_ context.Context, _ *framework.CycleState, pod *v1.Pod, nodeName string) (int64, *framework.Status) {
19
    // 读取历史节点信息
20
    var nodeHis []string
21
    if nodesS, ok := pod.ObjectMeta.Annotations[nodesAnnotation]; ok {
22
        _ = json.Unmarshal([]byte(nodesS), &nodeHis)
23
    }
24

25
    // 读取资源信息
26
    var realResources []realResourceUsage
27
    if realS, ok := pod.ObjectMeta.Annotations[readResourceAnnotation]; ok {
28
        _ = json.Unmarshal([]byte(realS), &realResources)
29
    }
30

31
    // 计算历史节点分数
32
    nodeScore := calculateNodeHisScore(nodeHis, nodeName)
33

34
    // 计算资源分数
35
    // ...省略资源分数计算逻辑...
36
    realResourceScore := ... // 通过 realResources 和节点资源情况计算
37

38
    // 返回总分
39
    return nodeScore + realResourceScore, nil
40
}
41

42
var nodeHisScores = map[int]int64{0: 30, 1: 20, 2: 10}
43

44
// calculateNodeHisScore 计算历史节点分数，将3个历史节点从最近到最远依次打分 30 - 10分
45
func calculateNodeHisScore(nodeHis []string, nodeName string) int64 {
46
  if len(nodeHis) == 0 {
47
    return framework.MinNodeScore
48
  }
49

50
  for index, name := range nodeHis {
51
    if name != nodeName {
52
      continue
53
    }
54

55
    score := framework.MinNodeScore
56
    if indexS, ok := nodeHisScores[index]; ok {
57
      score = indexS
58
    }
59

60
    return score
61
  }
62

63
  return framework.MinNodeScore
64
}

在插件的 Score 阶段，会读取 Pod 的 bkci.dispatch.kubenetes/builder-history-nodes 注解内容，并将其反序列化为历史节点名称数组，即提供历史节点信息。
插件通过 calculateNodeHisScore 方法，根据当前调度节点是否在历史节点列表中，以及其在列表中的顺序，给予不同的分数（最近的历史节点分数最高）。
该分数会与资源分数（通过 bkci.dispatch.kubenetes/builder-real-resources 注解和节点资源情况计算得出）相加，作为最终调度优先级，影响调度器选择节点的排序。

总结#

在蓝盾流水线中，通过以下方式实现了 Kubernetes 的调度优化：

历史节点调度：通过注解记录历史节点信息，调度插件优先选择这些节点，减少初始化时间。
亲和性（Affinity）：根据配置文件中的 nodeSelector 和代码中的 NodeMatches 转换为 nodeAffinity，确保 Pod 调度到特定节点。
污点容忍（Tolerations）：仅在配置文件中指定了专机（privateMachine）时，生成污点容忍配置，允许 Pod 调度到带特定污点的节点。

这些机制协同提升了调度效率和资源利用率。

参考#

蓝盾「Kubernetes 调度优化」

https://hua-ri.cn/posts/蓝盾kubernetes-调度优化/

作者

花日

发布于

2025-08-09

许可协议

CC BY-NC-SA 4.0

部分信息可能已经过时

开发环境搭建

Kaniko容器化构建调研

花日の博客

背景#

部署配置#

源码分析#

亲和性和污点容忍#

历史节点调度#

总结#

参考#