背景 在使用蓝盾「Docker公共构建机」一段时间后,我们发现构建镜像偶发性超时。排查后发现是由于集群的 Node 节点的磁盘满了,本文会介绍如何清理构建缓存。
我们发现构建镜像偶发性超时,排查发现是上了 Docker-in-Docker 构建镜像之后发生的,而且发生频率越来越高,进一步排查发现是由于 Pod 会通过 hostPath 挂载工作目录和日志目录,由于构建任务过多导致 Node 节点磁盘打满。
排查过程 事件分析 通过 Pod 事件可以发现是由于 Node 节点磁盘打满,导致 Pod 被驱逐,构建任务失败。
1 2 3 4 5 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Evicted 20m kubelet The node was low on resource: ephemeral-storage. Container build1753761077695-ivcpmoxg was using 1580320Ki, which exceeds its request of 0. Normal NodeHasNoDiskPressure 3m (x32 over 6d5h) kubelet Node 10.10.32.2 status is now: NodeHasNoDiskPressure
pod yaml:
1 2 3 4 5 6 7 8 9 volumes: - hostPath: path: /data/landun/workspace/build1753761077695-ivcpmoxg type: "" name: data-volume - hostPath: path: /data/landun/logs/build1753761077695-ivcpmoxg type: "" name: logs-volume
是由于 Pod 通过 hostPath 挂载工作目录和日志目录,通过 hostPath 挂载目录是为了做缓存,当同一流水线任务重复执行时能够加速。
dispatch-k8s-manager 模块的配置文件 : dispatch-k8s-manager/resources/config.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 dispatch: volume: builderConfigMap: name: dispatch-kubernetes-builder items: - key: initsh.properties path: init.sh - key: sleepsh.properties path: sleep.sh hostPath: dataHostDir: /data/landun/workspace logsHostDir: /data/landun/logs cfs: path: /data/cfs volumeMount: dataPath: /data/landun/workspace logPath: /data/logs builderConfigMapPath: /data/landun/config cfs: path: /data/bkdevops/apps readOnly: true
源码分析 dispatch-k8s-manager/pkg/apiserver/service/builder_start.go
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 func getBuilderVolumeAndMount ( workloadName string , nFSs []types.NFS, ) (volumes []corev1.Volume, volumeMounts []corev1.VolumeMount) { volumes = getBuilderPodVolume(workloadName) volumeMounts = getBuilderPodVolumeMount() ... return volumes, volumeMounts } func getBuilderPodVolume (workloadName string ) []corev1.Volume { dataHostPath := filepath.Join(config.Config.Dispatch.Volume.HostPath.DataHostDir, workloadName) logHostPath := filepath.Join(config.Config.Dispatch.Volume.HostPath.LogsHostDir, workloadName) var items []corev1.KeyToPath for _, v := range config.Config.Dispatch.Volume.BuilderConfigMap.Items { items = append (items, corev1.KeyToPath{ Key: v.Key, Path: v.Path, }) } return ... }
通过源码分析可以发现 hostPath 是通过 dispatch-k8s-manager/resources/config.yaml 加上 workloadName 拼接而成的,所以没办法通过配置文件控制不使用 hostPath,于是我们通过定时任务来清理该缓存。
解决方案 参考 bk-applog-bkapp-filebeat 的日志清理方案,通过 DaemonSet 实现蓝盾挂载工作目录实施定时清理操作。
1 2 3 4 5 6 NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE bk-applog-bkapp-filebeat-ingress 18 18 18 18 18 <none> 424d bk-applog-bkapp-filebeat-json 18 18 18 18 18 <none> 424d bk-applog-bkapp-filebeat-log-cleaner 18 18 18 18 18 <none> 424d bk-applog-bkapp-filebeat-stdout 18 18 18 18 18 <none> 424d bk-ci-builder-cleaner 18 18 18 18 18 <none> 13d
编写 daemonSet.yaml
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 apiVersion: apps/v1 kind: DaemonSet metadata: name: bk-ci-builder-cleaner namespace: blueking labels: app: bk-ci-builder spec: revisionHistoryLimit: 10 selector: matchLabels: app: bk-ci-builder template: metadata: labels: app: bk-ci-builder name: bk-ci-builder-cleaner spec: hostPID: true restartPolicy: Always serviceAccountName: bk-applog-bkapp-filebeat containers: - name: batch-delete-files image: xxx.xxx.com/bk-ci-builder-cleaner:v1 imagePullPolicy: IfNotPresent command: - bash args: - -c - while true ; do ./delete_files.sh; sleep 21600 ; done; resources: requests: cpu: 25m memory: 32Mi limits: cpu: 2560m memory: 256Mi volumeMounts: - mountPath: /data/devops/workspace name: data-volume - mountPath: /data/devops/logs name: logs-volume volumes: - name: data-volume hostPath: path: /data/landun/workspace type: DirectoryOrCreate - name: logs-volume hostPath: path: /data/landun/logs type: DirectoryOrCreate
缓存清理脚本 delete_files.sh
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 # !/usr/bin/env bash # delete_files.sh —— 正式删除版 # 同时扫描 /data/devops/workspace 和 /data/devops/logs set -euo pipefail # --------- 可配置参数 --------- ROOT_DIRS=("/data/devops/workspace" "/data/devops/logs") RETENTION_DAYS=7 LOG_FILE="/tmp/delete_build_dirs.log" # ----------------------------- log() { printf '%s [%s] %s\n' "$(date '+%F %T')" "$1" "$2" | tee -a "$LOG_FILE" } cutoff_date=$(date -d "$RETENTION_DAYS days ago" +%F) log INFO "==== 开始检查并删除 $RETENTION_DAYS 天未更新的 build* 目录 ====" for root in "${ROOT_DIRS[@]}"; do [[ -d $root ]] || { log WARN "目录不存在: $root"; continue; } for dir in "$root"/build*; do [[ -d $dir ]] || continue # 二次确认:目录内是否仍无任何 7 天内更新的文件 if ! find "$dir" -type f -newermt "$cutoff_date" -print -quit | grep -q .; then log DELETE "$dir" rm -rf "$dir" else log SKIP "$dir" fi done done log INFO "==== 清理完成,日志: $LOG_FILE ===="
参考 https://github.com/TencentBlueKing/bk-ci/blob/v2.0.0/ https://blazehu.com/2025/07/17/devops/landun_dind_cleaner/