Argo Workflows 完全实战指南
1. Argo Workflows 简介
1.1 什么是 Argo Workflows
Argo Workflows 是一个开源的容器原生工作流引擎,用于在 Kubernetes 上编排并行任务。它通过 Kubernetes CRD(Custom Resource Definition)实现,是 CNCF 毕业项目。
官方文档:https://argo-workflows.readthedocs.io
源码地址:https://github.com/argoproj/argo-workflows
1.2 核心特性
- 容器原生:每个工作流步骤都在容器中运行,无需传统 VM 环境的开销
- DAG 支持:支持有向无环图(DAG)定义复杂的任务依赖关系
- 并行执行:自动并行执行独立任务,提升执行效率
- 工件管理:支持在步骤间传递文件和数据
- 参数化:支持工作流参数化,实现可复用的模板
- 条件执行:支持基于条件的分支执行
- 重试机制:内置失败重试和超时控制
- 可视化界面:提供直观的 Web UI 查看工作流执行状态
1.3 应用场景
| 场景 | 说明 | 典型用例 |
|---|---|---|
| CI/CD 流水线 | 构建、测试、部署自动化 | 代码编译、单元测试、镜像构建、部署发布 |
| 数据处理 | 批量数据处理和 ETL | 数据清洗、转换、聚合、导入导出 |
| 机器学习 | ML 模型训练和推理 | 数据预处理、模型训练、超参数调优、模型评估 |
| 基础设施自动化 | 集群管理和运维任务 | 备份恢复、资源清理、健康检查、批量操作 |
1.4 Argo Workflows vs 其他工作流引擎
| 特性 | Argo Workflows | Jenkins | Airflow |
|---|---|---|---|
| 运行环境 | Kubernetes 原生 | 独立服务器/容器 | 独立服务器/容器 |
| 定义方式 | YAML(声明式) | Groovy/Pipeline(命令式) | Python(命令式) |
| 容器支持 | 原生支持,每步骤一个容器 | 需要插件 | 需要 KubernetesExecutor |
| 并行执行 | 自动并行 | 需要手动配置 | 支持,但配置复杂 |
| 可扩展性 | 基于 K8s,自动扩展 | 需要手动配置节点 | 需要配置 Worker |
| 学习曲线 | 中等(需要了解 K8s) | 低 | 中等(需要了解 Python) |
2. 环境准备
2.1 创建 Kind 集群
在开始之前,我们需要一个实验环境。如果你没有现成的Kubernetes集群,可以基于kind搭建测试集群快速创建一个:
配置文件:
kind: ClusterapiVersion: kind.x-k8s.io/v1alpha4networking: apiServerAddress: "10.10.151.201"nodes:- role: control-plane extraPortMappings: - containerPort: 6443 hostPort: 6443 listenAddress: "10.10.151.201" protocol: tcp- role: control-plane- role: control-plane- role: worker- role: worker- role: worker创建高可用集群命令:
sudo kind create cluster --config=huari.yaml --name huari-test --image kindest/node:v1.34.0 --retain; sudo kind export logs --name huari-test切换kubectl上下文:
sudo kubectl cluster-info --context kind-huari-test查看信息:
# 查看集群节点sudo kubectl get nodes
# 查看集群全部的podsudo kubectl get pods -A -owide删除集群:
sudo kind delete cluster --name huari-test3. 安装 Argo Workflows
3.1 使用 Helm 安装
# 添加 Argo Helm 仓库helm repo add argo https://argoproj.github.io/argo-helmhelm repo update
# 安装 Argo Workflows(测试环境配置)helm upgrade --install argo-workflows argo/argo-workflows \ --version 0.47.4 \ --namespace argo \ --create-namespace \ --set server.secure=false \ --set server.extraArgs[0]="--auth-mode=server" \ --set workflow.serviceAccount.create=true \ --set workflow.serviceAccount.name=argo-workflow \ --set workflow.rbac.create=true \ --set 'workflow.rbac.rules[0].apiGroups[0]=' \ --set 'workflow.rbac.rules[0].apiGroups[1]=apps' \ --set 'workflow.rbac.rules[0].resources[0]=*' \ --set 'workflow.rbac.rules[0].verbs[0]=*'配置说明:
server.secure=false:禁用 HTTPS,使用 HTTP(测试环境)--auth-mode=server:使用服务器模式认证(无需登录)workflow.serviceAccount.create=true:自动创建工作流 ServiceAccountworkflow.serviceAccount.name=argo-workflow:ServiceAccount 名称workflow.rbac.create=true:自动创建 RBAC 权限workflow.rbac.rules[0]:给予工作流在 argo 命名空间的完整资源操作权限(测试环境)
如果网络存在问题,也可以从本地 chart 进行安装:
# 下载 charthelm pull argo/argo-workflows --version 0.47.4
# 从本地安装helm upgrade --install argo-workflows ./argo-workflows-0.47.4.tgz \ --namespace argo \ --create-namespace \ --set server.secure=false \ --set server.extraArgs[0]="--auth-mode=server" \ --set workflow.serviceAccount.create=true \ --set workflow.serviceAccount.name=argo-workflow \ --set workflow.rbac.create=true \ --set 'workflow.rbac.rules[0].apiGroups[0]=' \ --set 'workflow.rbac.rules[0].apiGroups[1]=apps' \ --set 'workflow.rbac.rules[0].resources[0]=*' \ --set 'workflow.rbac.rules[0].verbs[0]=*'3.2 验证安装
# 查看 Pod 状态kubectl get pods -n argo -w期望输出:
NAME READY STATUS RESTARTS AGEargo-workflows-server-695449f55-jbqvt 1/1 Running 0 76sargo-workflows-workflow-controller-858ff4bbc7-pjl9h 1/1 Running 0 76s3.4 配置工件仓库(可选)
如果需要在工作流步骤间传递文件(Artifacts),需要配置工件仓库。支持 S3、GCS、OSS 等对象存储。
使用 MinIO(本地测试)
# 安装 MinIO(单节点模式,适合测试环境)helm repo add minio https://charts.min.io/helm install minio minio/minio \ --namespace argo \ --set mode=standalone \ --set replicas=1 \ --set persistence.enabled=false \ --set rootUser=admin \ --set rootPassword=password123 \ --set resources.requests.memory=512Mi
# 创建 Secretkubectl create secret generic minio-credentials -n argo \ --from-literal=accesskey=admin \ --from-literal=secretkey=password123
# 配置 Artifact Repositorykubectl apply -f - <<EOFapiVersion: v1kind: ConfigMapmetadata: name: artifact-repositories namespace: argodata: default-v1: | s3: bucket: argo-artifacts endpoint: minio.argo.svc.cluster.local:9000 insecure: true accessKeySecret: name: minio-credentials key: accesskey secretKeySecret: name: minio-credentials key: secretkeyEOF
# 创建 bucketkubectl apply -f - <<EOFapiVersion: batch/v1kind: Jobmetadata: name: minio-setup namespace: argospec: template: spec: restartPolicy: OnFailure containers: - name: mc image: minio/mc:latest command: - /bin/sh - -c - | mc alias set minio http://minio.argo.svc.cluster.local:9000 admin password123 mc mb minio/argo-artifacts || true mc ls minio/EOF
# 查看 Job 执行结果kubectl wait --for=condition=complete --timeout=60s job/minio-setup -n argokubectl logs job/minio-setup -n argo
# 清理 Jobkubectl delete job minio-setup -n argo
# 配置 Workflow Controller 使用 Artifact Repositorykubectl patch configmap argo-workflows-workflow-controller-configmap -n argo --type merge -p '{ "data": { "config": "nodeEvents:\n enabled: true\nworkflowEvents:\n enabled: true\nartifactRepository:\n archiveLogs: false\n s3:\n bucket: argo-artifacts\n endpoint: minio.argo.svc.cluster.local:9000\n insecure: true\n accessKeySecret:\n name: minio-credentials\n key: accesskey\n secretKeySecret:\n name: minio-credentials\n key: secretkey\n" }}'
# 等待 Workflow Controller 自动重启并加载配置kubectl get pods -n argo -l app=workflow-controller -w重要说明:
- Workflow Controller 的 ConfigMap 中只能有一个
config键artifactRepository配置必须放在config键的内容中- 配置更新后,Workflow Controller 会自动重启
使用阿里云 OSS
apiVersion: v1kind: Secretmetadata: name: oss-credentials namespace: argotype: OpaquestringData: accessKey: YOUR_OSS_ACCESS_KEY_ID secretKey: YOUR_OSS_ACCESS_KEY_SECRET---apiVersion: v1kind: ConfigMapmetadata: name: artifact-repositories namespace: argodata: default-v1: | s3: bucket: argo-artifacts endpoint: oss-cn-hangzhou.aliyuncs.com insecure: false accessKeySecret: name: oss-credentials key: accessKey secretKeySecret: name: oss-credentials key: secretKey说明:
endpoint:根据你的 OSS Bucket 所在地域修改,例如:
- 杭州:
oss-cn-hangzhou.aliyuncs.com- 北京:
oss-cn-beijing.aliyuncs.com- 上海:
oss-cn-shanghai.aliyuncs.com- 深圳:
oss-cn-shenzhen.aliyuncs.combucket:替换为你的 OSS Bucket 名称- OSS 兼容 S3 协议,因此使用
s3配置即可
4. 安装 Argo CLI
macOS
# 使用 Homebrewbrew install argo
# 或手动下载ARGO_WORKFLOWS_VERSION=v3.7.10curl -sLO https://github.com/argoproj/argo-workflows/releases/download/${ARGO_WORKFLOWS_VERSION}/argo-darwin-arm64.gzgunzip argo-darwin-arm64.gzchmod +x argo-darwin-arm64sudo mv argo-darwin-arm64 /usr/local/bin/argoLinux
# 设置版本号(注意:这是 Argo Workflows 应用版本,不是 Helm chart 版本)ARGO_WORKFLOWS_VERSION=v3.7.10
# 下载并安装curl -sLO https://github.com/argoproj/argo-workflows/releases/download/${ARGO_WORKFLOWS_VERSION}/argo-linux-amd64.gzgunzip argo-linux-amd64.gzchmod +x argo-linux-amd64sudo mv argo-linux-amd64 /usr/local/bin/argo验证安装:
argo version5. 访问 Argo Server UI
Port Forward(测试环境)
kubectl -n argo port-forward deployment/argo-workflows-server 2746:2746 --address 0.0.0.0访问:http://localhost:2746 或 http://YOUR_HOST_IP:2746
6. 核心概念
4.1 Workflow
Workflow 是 Argo Workflows 的核心资源,定义了要执行的工作流。它具有双重职责:
- 定义工作流:描述工作流的结构和执行逻辑
- 存储状态:记录工作流的执行状态和结果
基本结构:
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: hello-world- # 使用 generateName 自动生成唯一名称spec: serviceAccountName: argo-workflow entrypoint: main # 入口模板 templates: - name: main container: image: busybox command: [echo] args: ["Hello World"]4.2 Template 类型
Argo Workflows 提供 9 种模板类型,分为两大类:
模板定义类(Template Definitions)
定义实际要执行的工作:
| 类型 | 说明 | 使用场景 |
|---|---|---|
| Container | 运行容器 | 执行命令、运行应用 |
| Script | 运行脚本 | Python/Bash 脚本执行 |
| Resource | 操作 K8s 资源 | 创建/删除/更新资源 |
| Suspend | 暂停执行 | 人工审批、等待条件 |
| HTTP | 发送 HTTP 请求 | 调用 API、Webhook |
| Plugin | 执行插件 | 扩展功能 |
| Container Set | 运行多容器 | 需要多容器协作的任务 |
模板调用类(Template Invocators)
控制工作流执行逻辑:
| 类型 | 说明 | 使用场景 |
|---|---|---|
| Steps | 顺序执行步骤 | 线性流程、多阶段任务 |
| DAG | 有向无环图 | 复杂依赖关系、并行任务 |
4.3 参数(Parameters)
参数用于在模板间传递数据:
spec: arguments: parameters: - name: message value: "Hello Argo" templates: - name: print-message inputs: parameters: - name: message container: image: busybox command: [echo] args: ["{{inputs.parameters.message}}"]4.4 工件(Artifacts)
工件用于在步骤间传递文件:
templates:- name: generate-artifact container: image: busybox command: [sh, -c] args: ["echo 'hello world' > /tmp/hello.txt"] outputs: artifacts: - name: hello-art path: /tmp/hello.txt
- name: consume-artifact inputs: artifacts: - name: hello-art path: /tmp/hello.txt container: image: busybox command: [cat] args: ["/tmp/hello.txt"]7. 实战案例
7.1 案例 1:Hello World
最简单的工作流示例:
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: hello-world-spec: serviceAccountName: argo-workflow entrypoint: main templates: - name: main container: image: busybox command: [echo] args: ["Hello World"]提交工作流:
# 使用 kubectlkubectl create -f hello-world.yaml -n argo
# 使用 argo CLIargo submit hello-world.yaml -n argo --watch
# 查看日志argo logs @latest -n argo7.2 案例 2:Steps 顺序执行
使用 Steps 实现多步骤顺序执行:
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: steps-spec: serviceAccountName: argo-workflow entrypoint: main templates: - name: main steps: - - name: step1 template: echo arguments: parameters: - name: message value: "Step 1" - - name: step2a template: echo arguments: parameters: - name: message value: "Step 2a" - name: step2b template: echo arguments: parameters: - name: message value: "Step 2b" - - name: step3 template: echo arguments: parameters: - name: message value: "Step 3"
- name: echo inputs: parameters: - name: message container: image: busybox command: [echo] args: ["{{inputs.parameters.message}}"]执行流程:
- Step 1 执行
- Step 2a 和 Step 2b 并行执行
- Step 3 执行
提交工作流:
argo submit steps-workflow.yaml -n argo --watch期望类似的最终结果:
Name: steps-fsc4wNamespace: argoServiceAccount: argo-workflowStatus: SucceededConditions: PodRunning False Completed TrueCreated: Wed Mar 04 11:17:23 +0800 (56 seconds ago)Started: Wed Mar 04 11:17:23 +0800 (56 seconds ago)Finished: Wed Mar 04 11:18:19 +0800 (now)Duration: 56 secondsProgress: 4/4ResourcesDuration: 0s*(1 cpu),20s*(100Mi memory)
STEP TEMPLATE PODNAME DURATION MESSAGE ✔ steps-fsc4w main ├───✔ step1 echo steps-fsc4w-echo-3409550729 4s ├─┬─✔ step2a echo steps-fsc4w-echo-454416900 4s │ └─✔ step2b echo steps-fsc4w-echo-504749757 33s └───✔ step3 echo steps-fsc4w-echo-3563388113 5s7.3 案例 3:DAG 并行执行
使用 DAG 定义复杂的依赖关系:
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: dag-spec: serviceAccountName: argo-workflow entrypoint: main templates: - name: main dag: tasks: - name: A template: echo arguments: parameters: - name: message value: "Task A"
- name: B dependencies: [A] template: echo arguments: parameters: - name: message value: "Task B"
- name: C dependencies: [A] template: echo arguments: parameters: - name: message value: "Task C"
- name: D dependencies: [B, C] template: echo arguments: parameters: - name: message value: "Task D"
- name: echo inputs: parameters: - name: message container: image: busybox command: [echo] args: ["{{inputs.parameters.message}}"]执行流程:
A / \ B C \ / D提交工作流:
argo submit dag-workflow.yaml -n argo --watch7.4 案例 4:脚本执行
使用 Script 模板执行 Python 脚本:
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: script-spec: serviceAccountName: argo-workflow entrypoint: main templates: - name: main steps: - - name: generate template: gen-random-int - - name: print template: print-message arguments: parameters: - name: message value: "{{steps.generate.outputs.result}}"
- name: gen-random-int script: image: python:alpine3.23 command: [python] source: | import random i = random.randint(1, 100) print(i)
- name: print-message inputs: parameters: - name: message container: image: busybox command: [echo] args: ["Random number: {{inputs.parameters.message}}"]提交工作流:
argo submit script-workflow.yaml -n argo --watch期望类似的最终结果:
Name: dag-bhp6sNamespace: argoServiceAccount: argo-workflowStatus: SucceededConditions: PodRunning False Completed TrueCreated: Wed Mar 04 11:20:08 +0800 (30 seconds ago)Started: Wed Mar 04 11:20:08 +0800 (30 seconds ago)Finished: Wed Mar 04 11:20:38 +0800 (now)Duration: 30 secondsProgress: 4/4ResourcesDuration: 19s*(100Mi memory),0s*(1 cpu)
STEP TEMPLATE PODNAME DURATION MESSAGE ✔ dag-bhp6s main ├─✔ A echo dag-bhp6s-echo-1335943234 4s ├─✔ B echo dag-bhp6s-echo-1319165615 4s ├─✔ C echo dag-bhp6s-echo-1302387996 5s └─✔ D echo dag-bhp6s-echo-1285610377 4s7.5 案例 5:工件传递
在步骤间传递文件。
前置条件:工件传递需要配置 Artifact Repository,请先完成 3.4 节的 MinIO 配置。
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: artifact-spec: serviceAccountName: argo-workflow entrypoint: main templates: - name: main steps: - - name: generate template: generate-file - - name: consume template: consume-file arguments: artifacts: - name: input-file from: "{{steps.generate.outputs.artifacts.output-file}}"
- name: generate-file container: image: busybox command: [sh, -c] args: ["echo 'Hello from artifact' > /tmp/output.txt"] outputs: artifacts: - name: output-file path: /tmp/output.txt
- name: consume-file inputs: artifacts: - name: input-file path: /tmp/input.txt container: image: busybox command: [cat] args: ["/tmp/input.txt"]提交工作流:
argo submit artifact-workflow.yaml -n argo --watch期望类似的最终结果:
Name: artifact-t69b6Namespace: argoServiceAccount: argo-workflowStatus: SucceededConditions: PodRunning False Completed TrueCreated: Wed Mar 04 11:35:08 +0800 (20 seconds ago)Started: Wed Mar 04 11:35:08 +0800 (20 seconds ago)Finished: Wed Mar 04 11:35:28 +0800 (now)Duration: 20 secondsProgress: 2/2ResourcesDuration: 0s*(1 cpu),8s*(100Mi memory)
STEP TEMPLATE PODNAME DURATION MESSAGE ✔ artifact-t69b6 main ├───✔ generate generate-file artifact-t69b6-generate-file-149832164 4s └───✔ consume consume-file artifact-t69b6-consume-file-3273100530 4s7.6 案例 6:条件执行
基于条件执行不同的分支:
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: conditional-spec: serviceAccountName: argo-workflow entrypoint: main arguments: parameters: - name: environment value: "production"
templates: - name: main steps: - - name: check-env template: check-environment
- - name: prod-deploy template: deploy arguments: parameters: - name: env value: "production" when: "{{steps.check-env.outputs.result}} == production"
- name: dev-deploy template: deploy arguments: parameters: - name: env value: "development" when: "{{steps.check-env.outputs.result}} != production"
- name: check-environment script: image: python:alpine3.23 command: [python] source: | print("{{workflow.parameters.environment}}")
- name: deploy inputs: parameters: - name: env container: image: busybox command: [echo] args: ["Deploying to {{inputs.parameters.env}}"]提交工作流:
# 生产环境argo submit conditional-workflow.yaml -n argo -p environment=production --watch
# 开发环境argo submit conditional-workflow.yaml -n argo -p environment=development --watch生产环境期望类似的最终结果:
Name: conditional-ltr9xNamespace: argoServiceAccount: argo-workflowStatus: SucceededConditions: PodRunning False Completed TrueCreated: Wed Mar 04 11:42:20 +0800 (34 seconds ago)Started: Wed Mar 04 11:42:20 +0800 (34 seconds ago)Finished: Wed Mar 04 11:42:54 +0800 (now)Duration: 34 secondsProgress: 2/2ResourcesDuration: 1s*(1 cpu),18s*(100Mi memory)Parameters: environment: production
STEP TEMPLATE PODNAME DURATION MESSAGE ✔ conditional-ltr9x main ├───✔ check-env check-environment conditional-ltr9x-check-environment-504116037 14s └─┬─○ dev-deploy deploy when 'production != production' evaluated false └─✔ prod-deploy deploy conditional-ltr9x-deploy-754292413 4s测试环境期望类似的最终结果:
Name: conditional-px5t2Namespace: argoServiceAccount: argo-workflowStatus: SucceededConditions: PodRunning False Completed TrueCreated: Wed Mar 04 11:43:19 +0800 (30 seconds ago)Started: Wed Mar 04 11:43:19 +0800 (30 seconds ago)Finished: Wed Mar 04 11:43:49 +0800 (now)Duration: 30 secondsProgress: 2/2ResourcesDuration: 1s*(1 cpu),19s*(100Mi memory)Parameters: environment: development
STEP TEMPLATE PODNAME DURATION MESSAGE ✔ conditional-px5t2 main ├───✔ check-env check-environment conditional-px5t2-check-environment-1115641911 16s └─┬─✔ dev-deploy deploy conditional-px5t2-deploy-12024433 5s └─○ prod-deploy deploy7.7 案例 7:重试机制
配置失败重试:
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: retry-spec: serviceAccountName: argo-workflow entrypoint: main templates: - name: main retryStrategy: limit: "3" retryPolicy: "Always" backoff: duration: "5s" factor: 2 maxDuration: "1m" container: image: python:alpine3.23 command: [python] args: - -c - | import random import sys # 70% 概率失败 if random.random() < 0.7: print("Task failed!") sys.exit(1) else: print("Task succeeded!")提交工作流:
argo submit retry-workflow.yaml -n argo --watch期望类似的最终结果:
Name: retry-cf29cNamespace: argoServiceAccount: argo-workflowStatus: SucceededConditions: PodRunning False Completed TrueCreated: Wed Mar 04 11:44:21 +0800 (10 seconds ago)Started: Wed Mar 04 11:44:21 +0800 (10 seconds ago)Finished: Wed Mar 04 11:44:31 +0800 (now)Duration: 10 secondsProgress: 1/1ResourcesDuration: 0s*(1 cpu),4s*(100Mi memory)
STEP TEMPLATE PODNAME DURATION MESSAGE ✔ retry-cf29c(0) main retry-cf29c-main-439993472 4s7.8 案例 8:CI/CD 流水线
使用真实的开源项目演示 CI/CD 流水线。本示例使用 flask-hello-world 项目进行代码检出、测试和部署。
注:为了方便,部署在了argo namespace下
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: cicd-pipeline-spec: serviceAccountName: argo-workflow entrypoint: main arguments: parameters: - name: repo value: "https://github.com/pallets/flask.git" - name: branch value: "main" - name: app-name value: "flask-demo"
volumeClaimTemplates: - metadata: name: workdir spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 500Mi
templates: - name: main steps: - - name: checkout template: git-clone
- - name: lint template: run-lint
- - name: test template: run-tests
- - name: deploy template: deploy-app
- name: git-clone container: image: alpine/git command: [sh, -c] args: - | git clone -b {{workflow.parameters.branch}} --depth 1 {{workflow.parameters.repo}} /work/repo cd /work/repo echo "Cloned repository: {{workflow.parameters.repo}}" echo "Branch: {{workflow.parameters.branch}}" echo "Commit: $(git rev-parse HEAD)" volumeMounts: - name: workdir mountPath: /work
- name: run-lint container: image: python:3.11-slim command: [sh, -c] args: - | cd /work/repo pip install --quiet flake8 echo "Running code linting..." flake8 src/flask --count --select=E9,F63,F7,F82 --show-source --statistics || true echo "Linting completed" volumeMounts: - name: workdir mountPath: /work
- name: run-tests container: image: python:3.11-slim command: [sh, -c] args: - | cd /work/repo pip install --quiet -e . pip install --quiet pytest echo "Running tests..." pytest tests/ -v || echo "Tests completed with some failures (demo purpose)" volumeMounts: - name: workdir mountPath: /work
- name: deploy-app resource: action: apply manifest: | apiVersion: v1 kind: ConfigMap metadata: name: {{workflow.parameters.app-name}}-config namespace: argo data: app.info: | Application: {{workflow.parameters.app-name}} Repository: {{workflow.parameters.repo}} Branch: {{workflow.parameters.branch}} Deployed: $(date)提交工作流:
# 使用默认参数argo submit cicd-pipeline.yaml -n argo --watch
# 或指定自定义参数argo submit cicd-pipeline.yaml -n argo \ -p repo=https://github.com/pallets/flask.git \ -p branch=main \ -p app-name=my-flask-app \ --watch期望类似的最终结果:
Name: cicd-pipeline-pfg6fNamespace: argoServiceAccount: argo-workflowStatus: SucceededConditions: PodRunning False Completed TrueCreated: Wed Mar 04 12:35:31 +0800 (2 minutes ago)Started: Wed Mar 04 12:35:31 +0800 (2 minutes ago)Finished: Wed Mar 04 12:37:53 +0800 (now)Duration: 2 minutes 22 secondsProgress: 4/4ResourcesDuration: 5s*(1 cpu),1m15s*(100Mi memory)Parameters: repo: https://github.com/pallets/flask.git branch: main app-name: flask-demo
STEP TEMPLATE PODNAME DURATION MESSAGE ✔ cicd-pipeline-pfg6f main ├───✔ checkout git-clone cicd-pipeline-pfg6f-git-clone-2106511117 47s ├───✔ lint run-lint cicd-pipeline-pfg6f-run-lint-717278403 26s ├───✔ test run-tests cicd-pipeline-pfg6f-run-tests-655980579 14s └───✔ deploy deploy-app cicd-pipeline-pfg6f-deploy-app-2393920939 26s验证部署结果:
# 查看创建的 ConfigMapkubectl get configmap flask-demo-config -n argo -o yaml8. WorkflowTemplate 可复用模板
WorkflowTemplate 允许定义可复用的工作流模板,便于在多个工作流中重复使用相同的逻辑。本示例使用 gin-gonic/gin 项目演示模板的创建和使用。
8.1 创建 WorkflowTemplate
apiVersion: argoproj.io/v1alpha1kind: WorkflowTemplatemetadata: name: go-build-test namespace: argospec: serviceAccountName: argo-workflow entrypoint: main arguments: parameters: - name: repo value: "https://github.com/gin-gonic/gin.git" - name: branch value: "master"
volumes: - name: workdir emptyDir: {}
templates: - name: main steps: - - name: checkout template: git-clone
- - name: build template: go-build
- - name: test template: go-test
- name: git-clone container: image: alpine/git command: [sh, -c] args: - | git clone -b {{workflow.parameters.branch}} --depth 1 {{workflow.parameters.repo}} /work/repo cd /work/repo echo "Repository: {{workflow.parameters.repo}}" echo "Branch: {{workflow.parameters.branch}}" echo "Commit: $(git rev-parse HEAD)" volumeMounts: - name: workdir mountPath: /work
- name: go-build container: image: golang:1.21-alpine command: [sh, -c] args: - | cd /work/repo echo "Building Go project..." go build -v ./... echo "Build completed successfully" volumeMounts: - name: workdir mountPath: /work
- name: go-test container: image: golang:1.21-alpine command: [sh, -c] args: - | cd /work/repo echo "Running Go tests..." go test -v ./... -short echo "Tests completed" volumeMounts: - name: workdir mountPath: /work创建模板:
kubectl apply -f workflow-template.yaml验证模板创建:
# 查看 WorkflowTemplatekubectl get workflowtemplate -n argo
# 查看详细信息kubectl describe workflowtemplate go-build-test -n argo8.2 使用 WorkflowTemplate
使用默认参数运行模板:
# 直接从 WorkflowTemplate 提交工作流argo submit --from workflowtemplate/go-build-test -n argo --watch使用自定义参数运行模板:
# 使用不同的分支argo submit --from workflowtemplate/go-build-test -n argo \ -p branch=v1.9.1 \ --watch或者通过 YAML 文件引用模板:
apiVersion: argoproj.io/v1alpha1kind: Workflowmetadata: generateName: use-go-template-spec: serviceAccountName: argo-workflow workflowTemplateRef: name: go-build-test arguments: parameters: - name: repo value: "https://github.com/gin-gonic/gin.git" - name: branch value: "master"提交工作流:
argo submit use-workflow-template.yaml -n argo --watch期望类似的最终结果:
Name: go-build-test-r89hhNamespace: argoServiceAccount: argo-workflowStatus: SucceededConditions: PodRunning False Completed TrueCreated: Wed Mar 04 12:52:22 +0800 (1 minute ago)Started: Wed Mar 04 12:52:22 +0800 (1 minute ago)Finished: Wed Mar 04 12:53:38 +0800 (now)Duration: 1 minute 16 secondsProgress: 3/3ResourcesDuration: 56s*(100Mi memory),4s*(1 cpu)Parameters: repo: https://github.com/gin-gonic/gin.git branch: master
STEP TEMPLATE PODNAME DURATION MESSAGE ✔ go-build-test-r89hh main ├───✔ checkout git-clone go-build-test-r89hh-git-clone-1757181677 5s ├───✔ build go-build go-build-test-r89hh-go-build-3721440394 47s └───✔ test go-test go-build-test-r89hh-go-test-2594344963 4s9. CronWorkflow 定时任务
9.1 创建 CronWorkflow
apiVersion: argoproj.io/v1alpha1kind: CronWorkflowmetadata: name: backup-daily namespace: argospec: schedule: "0 13 * * *" # 每天 13 点执行 timezone: "Asia/Shanghai" concurrencyPolicy: "Replace" # Replace, Allow, Forbid startingDeadlineSeconds: 0 successfulJobsHistoryLimit: 3 failedJobsHistoryLimit: 1
workflowSpec: entrypoint: backup templates: - name: backup steps: - - name: database-backup template: backup-db - - name: upload-to-s3 template: upload-backup
- name: backup-db container: image: postgres:15 command: [sh, -c] args: - | pg_dump -h postgres-host -U postgres mydb > /backup/backup.sql volumeMounts: - name: backup-volume mountPath: /backup
- name: upload-backup container: image: amazon/aws-cli command: [sh, -c] args: - | aws s3 cp /backup/backup.sql s3://my-bucket/backups/$(date +%Y%m%d).sql volumeMounts: - name: backup-volume mountPath: /backup
volumeClaimTemplates: - metadata: name: backup-volume spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 10Gi创建定时任务:
kubectl apply -f cron-workflow.yaml9.2 管理 CronWorkflow
# 查看所有 CronWorkflowkubectl get cronworkflow -n argo
# 查看详情kubectl describe cronworkflow backup-daily -n argo
# 暂停 CronWorkflowkubectl patch cronworkflow backup-daily -n argo -p '{"spec":{"suspend":true}}'
# 恢复 CronWorkflowkubectl patch cronworkflow backup-daily -n argo -p '{"spec":{"suspend":false}}'
# 删除 CronWorkflowkubectl delete cronworkflow backup-daily -n argo10. 常用命令
10.1 Workflow 管理
# 提交工作流argo submit workflow.yaml -n argo
# 提交并观察argo submit workflow.yaml -n argo --watch
# 提交并传递参数argo submit workflow.yaml -n argo -p param1=value1 -p param2=value2
# 列出工作流argo list -n argo
# 查看工作流详情argo get <workflow-name> -n argo
# 查看工作流日志argo logs <workflow-name> -n argo
# 查看最新工作流日志argo logs @latest -n argo
# 删除工作流argo delete <workflow-name> -n argo
# 删除所有已完成的工作流argo delete --completed -n argo
# 重新提交工作流argo resubmit <workflow-name> -n argo
# 重试失败的工作流argo retry <workflow-name> -n argo
# 暂停工作流argo suspend <workflow-name> -n argo
# 恢复工作流argo resume <workflow-name> -n argo
# 终止工作流argo terminate <workflow-name> -n argo
# 停止工作流argo stop <workflow-name> -n argo10.2 使用 kubectl 管理
# 查看工作流kubectl get workflow -n argo
# 查看工作流详情kubectl describe workflow <workflow-name> -n argo
# 查看工作流 YAMLkubectl get workflow <workflow-name> -n argo -o yaml
# 删除工作流kubectl delete workflow <workflow-name> -n argo
# 查看 WorkflowTemplatekubectl get workflowtemplate -n argo
# 查看 CronWorkflowkubectl get cronworkflow -n argo11. 常见问题排查
11.1 工作流一直处于 Pending 状态
原因:
- 资源不足(CPU/内存)
- 镜像拉取失败
- PVC 无法挂载
排查步骤:
# 查看工作流详情argo get <workflow-name> -n argo
# 查看 Pod 状态kubectl get pods -n argo -l workflows.argoproj.io/workflow=<workflow-name>
# 查看 Pod 详情kubectl describe pod <pod-name> -n argo
# 查看 Pod 事件kubectl get events -n argo --field-selector involvedObject.name=<pod-name>解决方案:
- 增加集群资源
- 检查镜像地址和拉取凭证
- 检查 StorageClass 配置
11.2 工作流执行失败
原因:
- 容器执行错误
- 超时
- 资源限制
排查步骤:
# 查看工作流日志argo logs <workflow-name> -n argo
# 查看特定步骤日志argo logs <workflow-name> -n argo -c <step-name>
# 查看工作流状态kubectl get workflow <workflow-name> -n argo -o jsonpath='{.status.phase}'解决方案:
- 检查容器日志,修复代码错误
- 调整超时时间
- 增加资源限制
11.3 工件传递失败
原因:
- Artifact Repository 未配置
- S3 凭证错误
- 网络问题
排查步骤:
# 查看 Artifact Repository 配置kubectl get configmap artifact-repositories -n argo -o yaml
# 查看工作流日志argo logs <workflow-name> -n argo解决方案:
- 配置正确的 Artifact Repository
- 验证 S3 凭证
- 检查网络连通性
11.4 权限问题
原因:
- ServiceAccount 权限不足
- RBAC 配置错误
排查步骤:
# 查看 ServiceAccountkubectl get sa -n argo
# 查看 RoleBindingkubectl get rolebinding -n argo
# 查看工作流使用的 ServiceAccountkubectl get workflow <workflow-name> -n argo -o jsonpath='{.spec.serviceAccountName}'解决方案:
- 创建正确的 ServiceAccount 和 RBAC 规则
- 在工作流中指定 ServiceAccount
12. 最佳实践
12.1 资源管理
设置资源限制:
templates:- name: resource-limited container: image: busybox command: [sh, -c] args: ["echo 'Running with resource limits'"] resources: requests: memory: "64Mi" cpu: "250m" limits: memory: "128Mi" cpu: "500m"使用节点选择器:
templates:- name: gpu-task container: image: tensorflow/tensorflow:latest-gpu command: [python, train.py] nodeSelector: gpu: "true" tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule"12.2 安全最佳实践
使用专用 ServiceAccount:
spec: serviceAccountName: workflow-sa避免使用 root 用户:
templates:- name: non-root container: image: busybox command: [sh, -c] args: ["whoami"] securityContext: runAsNonRoot: true runAsUser: 1000使用 Secret 管理敏感信息:
templates:- name: use-secret container: image: busybox command: [sh, -c] args: ["echo $PASSWORD"] env: - name: PASSWORD valueFrom: secretKeyRef: name: my-secret key: password12.3 性能优化
并行执行独立任务:
# 使用 DAG 而不是 Stepstemplates:- name: parallel-tasks dag: tasks: - name: task1 template: worker - name: task2 template: worker - name: task3 template: worker设置合理的超时时间:
spec: activeDeadlineSeconds: 3600 # 1 小时超时 templates: - name: task-with-timeout activeDeadlineSeconds: 600 # 10 分钟超时 container: image: busybox command: [sleep, "300"]使用 PVC 复用:
spec: volumeClaimTemplates: - metadata: name: workdir spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: 10Gi12.4 可维护性
使用 WorkflowTemplate 提高复用性:
# 定义通用模板apiVersion: argoproj.io/v1alpha1kind: WorkflowTemplatemetadata: name: common-tasksspec: templates: - name: notify inputs: parameters: - name: message container: image: curlimages/curl command: [sh, -c] args: - | curl -X POST https://hooks.slack.com/services/xxx \ -d '{"text":"{{inputs.parameters.message}}"}'添加标签和注解:
metadata: generateName: my-workflow- labels: app: myapp env: production team: platform annotations: description: "Daily backup workflow" owner: "platform-team@example.com"使用有意义的命名:
templates:- name: checkout-source-code # 清晰的名称 container: image: alpine/git command: [git, clone, "{{workflow.parameters.repo}}"]13. 总结
Argo Workflows 是一个强大的 Kubernetes 原生工作流引擎,适用于 CI/CD、数据处理、机器学习等多种场景。
核心优势:
- ✅ 容器原生,与 Kubernetes 深度集成
- ✅ 支持复杂的 DAG 和并行执行
- ✅ 丰富的模板类型和可复用性
- ✅ 完善的可视化界面和 CLI 工具
- ✅ 活跃的社区和生态系统
适用场景:
- CI/CD 流水线自动化
- 批量数据处理和 ETL
- 机器学习模型训练
- 基础设施自动化运维
学习路径:
- 掌握基本概念(Workflow、Template、Parameters)
- 实践简单案例(Hello World、Steps、DAG)
- 学习高级特性(工件传递、条件执行、重试)
- 构建实际项目(CI/CD 流水线、数据处理)
- 优化和监控(资源管理、性能优化、告警)
参考资源:
部分信息可能已经过时









