背景
在使用蓝盾「Docker公共构建机」一段时间后,我们发现构建镜像偶发性超时。排查后发现是由于集群的 Node 节点的磁盘满了,本文会介绍如何清理构建缓存。
我们发现构建镜像偶发性超时,排查发现是上了 Docker-in-Docker 构建镜像之后发生的,而且发生频率越来越高,进一步排查发现是由于 Pod 会通过 hostPath 挂载工作目录和日志目录,由于构建任务过多导致 Node 节点磁盘打满。
排查过程
事件分析
通过 Pod 事件可以发现是由于 Node 节点磁盘打满,导致 Pod 被驱逐,构建任务失败。
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Evicted 20m kubelet The node was low on resource: ephemeral-storage. Container build1753761077695-ivcpmoxg was using 1580320Ki, which exceeds its request of 0. Normal NodeHasNoDiskPressure 3m (x32 over 6d5h) kubelet Node 10.10.32.2 status is now: NodeHasNoDiskPressurepod yaml:
volumes:- hostPath: path: /data/landun/workspace/build1753761077695-ivcpmoxg type: "" name: data-volume- hostPath: path: /data/landun/logs/build1753761077695-ivcpmoxg type: "" name: logs-volume是由于 Pod 通过 hostPath 挂载工作目录和日志目录,通过 hostPath 挂载目录是为了做缓存,当同一流水线任务重复执行时能够加速。
dispatch-k8s-manager 模块的配置文件: dispatch-k8s-manager/resources/config.yaml
dispatch: volume: # 构建机脚本 builderConfigMap: name: dispatch-kubernetes-builder items: # 初始化脚本 - key: initsh.properties path: init.sh # 登录调试需要的sleep脚本 - key: sleepsh.properties path: sleep.sh hostPath: # 数据盘 dataHostDir: /data/landun/workspace # 日志盘 logsHostDir: /data/landun/logs # 应用数据使用cfs cfs: path: /data/cfs volumeMount: dataPath: /data/landun/workspace logPath: /data/logs builderConfigMapPath: /data/landun/config cfs: path: /data/bkdevops/apps readOnly: true源码分析
dispatch-k8s-manager/pkg/apiserver/service/builder_start.go
// getBuilderVolumeAndMount 获取一些构建机的常规的被挂载到pod上的volume和mountfunc getBuilderVolumeAndMount( workloadName string, nFSs []types.NFS,) (volumes []corev1.Volume, volumeMounts []corev1.VolumeMount) { volumes = getBuilderPodVolume(workloadName) volumeMounts = getBuilderPodVolumeMount()
...
return volumes, volumeMounts}
// getBuilderPodVolume 获取一些构建机的常规的被挂载到pod上的volume,包括配置configmap和data目录hostpathfunc getBuilderPodVolume(workloadName string) []corev1.Volume { dataHostPath := filepath.Join(config.Config.Dispatch.Volume.HostPath.DataHostDir, workloadName) logHostPath := filepath.Join(config.Config.Dispatch.Volume.HostPath.LogsHostDir, workloadName)
var items []corev1.KeyToPath for _, v := range config.Config.Dispatch.Volume.BuilderConfigMap.Items { items = append(items, corev1.KeyToPath{ Key: v.Key, Path: v.Path, }) }
return ...}通过源码分析可以发现 hostPath 是通过 dispatch-k8s-manager/resources/config.yaml 加上 workloadName 拼接而成的,所以没办法通过配置文件控制不使用 hostPath,于是我们通过定时任务来清理该缓存。
解决方案
参考 bk-applog-bkapp-filebeat 的日志清理方案,通过 DaemonSet 实现蓝盾挂载工作目录实施定时清理操作。
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGEbk-applog-bkapp-filebeat-ingress 18 18 18 18 18 <none> 424dbk-applog-bkapp-filebeat-json 18 18 18 18 18 <none> 424dbk-applog-bkapp-filebeat-log-cleaner 18 18 18 18 18 <none> 424dbk-applog-bkapp-filebeat-stdout 18 18 18 18 18 <none> 424dbk-ci-builder-cleaner 18 18 18 18 18 <none> 13d编写 daemonSet.yaml
apiVersion: apps/v1kind: DaemonSetmetadata: name: bk-ci-builder-cleaner namespace: blueking labels: app: bk-ci-builderspec: revisionHistoryLimit: 10 selector: matchLabels: app: bk-ci-builder template: metadata: labels: app: bk-ci-builder name: bk-ci-builder-cleaner spec: hostPID: true restartPolicy: Always serviceAccountName: bk-applog-bkapp-filebeat containers: - name: batch-delete-files image: xxx.xxx.com/bk-ci-builder-cleaner:v1 imagePullPolicy: IfNotPresent command: - bash args: - -c - while true; do ./delete_files.sh; sleep 21600; done; resources: requests: cpu: 25m memory: 32Mi limits: cpu: 2560m memory: 256Mi volumeMounts: - mountPath: /data/devops/workspace name: data-volume - mountPath: /data/devops/logs name: logs-volume volumes: - name: data-volume hostPath: path: /data/landun/workspace type: DirectoryOrCreate - name: logs-volume hostPath: path: /data/landun/logs type: DirectoryOrCreate缓存清理脚本 delete_files.sh
#!/usr/bin/env bash# delete_files.sh —— 正式删除版# 同时扫描 /data/devops/workspace 和 /data/devops/logsset -euo pipefail
# --------- 可配置参数 ---------ROOT_DIRS=("/data/devops/workspace" "/data/devops/logs")RETENTION_DAYS=7LOG_FILE="/tmp/delete_build_dirs.log"# -----------------------------
log() { printf '%s [%s] %s\n' "$(date '+%F %T')" "$1" "$2" | tee -a "$LOG_FILE"}
cutoff_date=$(date -d "$RETENTION_DAYS days ago" +%F)
log INFO "==== 开始检查并删除 $RETENTION_DAYS 天未更新的 build* 目录 ===="
for root in "${ROOT_DIRS[@]}"; do [[ -d $root ]] || { log WARN "目录不存在: $root"; continue; }
for dir in "$root"/build*; do [[ -d $dir ]] || continue
# 二次确认:目录内是否仍无任何 7 天内更新的文件 if ! find "$dir" -type f -newermt "$cutoff_date" -print -quit | grep -q .; then log DELETE "$dir" rm -rf "$dir" else log SKIP "$dir" fi donedone
log INFO "==== 清理完成,日志: $LOG_FILE ===="参考
https://github.com/TencentBlueKing/bk-ci/blob/v2.0.0/ https://blazehu.com/2025/07/17/devops/landun_dind_cleaner/
部分信息可能已经过时









