fix(alerts): recover reports stuck in WORKING state after worker crash by eschutho · Pull Request #39533 · apache/superset · GitHub
Skip to content

fix(alerts): recover reports stuck in WORKING state after worker crash#39533

Draft
eschutho wants to merge 3 commits intomasterfrom
alerts-reports-working-timeout-sc-104379
Draft

fix(alerts): recover reports stuck in WORKING state after worker crash#39533
eschutho wants to merge 3 commits intomasterfrom
alerts-reports-working-timeout-sc-104379

Conversation

@eschutho
Copy link
Copy Markdown
Member

@eschutho eschutho commented Apr 22, 2026

SUMMARY

Fixes a bug where alert/report schedules get permanently stuck in WORKING state when a Celery worker pod crashes mid-execution (SC-104379).

Root cause: When a pod crashes, the report stays in WORKING state. On broker requeue, the new worker enters ReportWorkingState.next() and raises ReportSchedulePreviousWorkingError — blocking re-execution for up to 1 hour until working_timeout expires, at which point it transitions to ERROR rather than retrying.

Fix: Adds a three-branch state machine in ReportWorkingState.next():

  1. elapsed >= working_timeout → ERROR (existing behavior — genuinely runaway job)
  2. elapsed >= ALERT_REPORTS_STALE_WORKING_TIMEOUT (new config, default 300s) → reset to NOOP and re-execute via ReportNotTriggeredErrorState
  3. elapsed < stale thresholdPreviousWorkingError (existing behavior — might be legitimately running)

Also refactors is_on_working_timeout() to accept an already-fetched log, eliminating a redundant DB query on every working-state evaluation.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A — backend-only change.

TESTING INSTRUCTIONS

  1. Unit tests cover all three branches — run pytest tests/unit_tests/commands/report/execute_test.py
  2. To manually verify: create an alert/report, start a Celery worker, kill the worker mid-execution (while report is in WORKING state), restart the worker, confirm the report re-executes rather than staying stuck until working_timeout expires.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration
  • Introduces new feature or API
  • Removes existing feature or API

New config key: ALERT_REPORTS_STALE_WORKING_TIMEOUT (seconds, default 300). Must be less than working_timeout on any schedule.

When a Celery worker pod crashes mid-execution, the report stays in
WORKING state until working_timeout expires (up to 1 hour), then
transitions to ERROR instead of retrying.

This adds a stale detection threshold (ALERT_REPORTS_STALE_WORKING_TIMEOUT,
default 300s). When a requeued worker sees a WORKING state older than this
threshold, it resets to NOOP and re-executes rather than blocking with
PreviousWorkingError. The existing working_timeout path (ERROR) is
preserved for genuinely runaway jobs.

Also refactors is_on_working_timeout() to accept an already-fetched log,
eliminating a redundant DB query on every working-state evaluation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 22, 2026

Add actionable guidance to the working_timeout field so users know to
set it relative to their report's typical execution time. Previously
the description only said the field resets a stalled alert to error,
with no guidance on what value to choose.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Comment thread superset/config.py Outdated
ALERT_REPORTS_DEFAULT_CRON_VALUE = "0 0 * * *" # every day
# Minimum elapsed time (seconds) before a WORKING state is considered stale
# (e.g. due to a crashed Celery worker) and eligible for reset + retry.
# Must be less than working_timeout on any schedule.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Must be less than working_timeout on any schedule.
# It would ideally be less than working_timeout on any schedule.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 40.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.56%. Comparing base (05fc5bb) to head (3fe7b64).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
superset/commands/report/execute.py 35.71% 7 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #39533      +/-   ##
==========================================
+ Coverage   64.54%   64.56%   +0.01%     
==========================================
  Files        2559     2561       +2     
  Lines      133496   133480      -16     
  Branches    31028    31018      -10     
==========================================
+ Hits        86168    86183      +15     
+ Misses      45836    45803      -33     
- Partials     1492     1494       +2     
Flag Coverage Δ
hive 39.86% <13.33%> (+0.03%) ⬆️
javascript 66.52% <ø> (ø)
mysql 60.43% <40.00%> (+0.04%) ⬆️
postgres 60.51% <40.00%> (+0.04%) ⬆️
presto 41.64% <13.33%> (+0.03%) ⬆️
python 62.08% <40.00%> (+0.04%) ⬆️
sqlite 60.14% <40.00%> (+0.04%) ⬆️
unit 100.00% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant