[RFC] retry: route action — Phase 2 of error-routing (companion to #229) by PolyphonyRequiem · Pull Request #236 · microsoft/conductor · GitHub
Skip to content

[RFC] retry: route action — Phase 2 of error-routing (companion to #229)#236

Draft
PolyphonyRequiem wants to merge 1 commit into
microsoft:mainfrom
PolyphonyRequiem:proposal/retry-route-action
Draft

[RFC] retry: route action — Phase 2 of error-routing (companion to #229)#236
PolyphonyRequiem wants to merge 1 commit into
microsoft:mainfrom
PolyphonyRequiem:proposal/retry-route-action

Conversation

@PolyphonyRequiem

Copy link
Copy Markdown
Member

Summary

Phase 2 RFC for the error-routing work in #229: adds a retry: route action that re-executes the same node on failure before escalating to a fallback error route.

This is a design document only — no implementation code. Filed as a DRAFT PR so it shows up in the conductor PR list alongside #229 and #227.

Design document: docs/proposals/retry-route-action.md


Motivation

PR #229 adds on_error: routing (catch and route elsewhere). It explicitly defers retry: to Phase 2. The gap: 14 of 19 polyphony AB#3257 human-gate nodes need retry-before-escalation semantics for idempotent infrastructure operations (git push, PR open, merge poll). Without retry:, transient network failures abort entire workflow runs; operators must re-trigger manually.

What this PR contains

  • docs/proposals/retry-route-action.md: full design covering schema, engine behavior, sub-workflow and for_each interaction, test plan, and open questions.

Sequencing

Phase 1: PR #229 merges (blocked on context.py sentinel conflict)
  ↓
Phase 2: This RFC approved
  ↓
Phase 2 implementation PR
  ↓
Polyphony YAML retrofit: 14 idempotent-retry gates removed (AB#3257)

Key design decisions (details in doc)

  • retry: only valid on error routes (on_error: required alongside)
  • to: optional when retry: is set (lint warning if both present)
  • max counts re-runs only (not first attempt): max: 3 = 4 total executions
  • Backoff: fixed or exponential, with initial_seconds and jitter (default ±25%)
  • Exhaustion: falls through to next matching on_error: route in document order
  • CONDUCTOR_RETRY_ATTEMPT env var + {{ conductor.retry_attempt }} template context
  • Sub-workflow retry: re-runs entire child workflow (Phase 1 validator restriction lifted)
  • for_each retry: per-iteration only (failed item retried, rest unaffected)

cc @jasonrobertfox — companion to #227 (RFC) and #229 (Phase 1).

Filed by Mahler (Conductor Expert) on the polyphony squad.

@codecov-commenter

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants