iframe-proxy

eschutho · 2026-04-22T01:21:38Z

SUMMARY

Fixes a bug where alert/report schedules get permanently stuck in WORKING state when a Celery worker pod crashes mid-execution (SC-104379).

Root cause: When a pod crashes, the report stays in WORKING state. On broker requeue, the new worker enters ReportWorkingState.next() and raises ReportSchedulePreviousWorkingError — blocking re-execution for up to 1 hour until working_timeout expires, at which point it transitions to ERROR rather than retrying.

Fix: Adds a three-branch state machine in ReportWorkingState.next():

elapsed >= working_timeout → ERROR (existing behavior — genuinely runaway job)
elapsed >= ALERT_REPORTS_STALE_WORKING_TIMEOUT (new config, default 300s) → reset to NOOP and re-execute via ReportNotTriggeredErrorState
elapsed < stale threshold → PreviousWorkingError (existing behavior — might be legitimately running)

Also refactors is_on_working_timeout() to accept an already-fetched log, eliminating a redundant DB query on every working-state evaluation.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A — backend-only change.

TESTING INSTRUCTIONS

Unit tests cover all three branches — run pytest tests/unit_tests/commands/report/execute_test.py
To manually verify: create an alert/report, start a Celery worker, kill the worker mid-execution (while report is in WORKING state), restart the worker, confirm the report re-executes rather than staying stuck until working_timeout expires.

ADDITIONAL INFORMATION

New config key: ALERT_REPORTS_STALE_WORKING_TIMEOUT (seconds, default 300). Must be less than working_timeout on any schedule.

When a Celery worker pod crashes mid-execution, the report stays in WORKING state until working_timeout expires (up to 1 hour), then transitions to ERROR instead of retrying. This adds a stale detection threshold (ALERT_REPORTS_STALE_WORKING_TIMEOUT, default 300s). When a requeued worker sees a WORKING state older than this threshold, it resets to NOOP and re-executes rather than blocking with PreviousWorkingError. The existing working_timeout path (ERROR) is preserved for genuinely runaway jobs. Also refactors is_on_working_timeout() to accept an already-fetched log, eliminating a redundant DB query on every working-state evaluation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

netlify · 2026-04-22T01:23:16Z

Add actionable guidance to the working_timeout field so users know to set it relative to their report's typical execution time. Previously the description only said the field resets a stalled alert to error, with no guidance on what value to choose. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eschutho · 2026-04-22T01:36:23Z

 ALERT_REPORTS_DEFAULT_CRON_VALUE = "0 0 * * *"  # every day
+# Minimum elapsed time (seconds) before a WORKING state is considered stale
+# (e.g. due to a crashed Celery worker) and eligible for reset + retry.
+# Must be less than working_timeout on any schedule.


Suggested change

# Must be less than working_timeout on any schedule.

# It would ideally be less than working_timeout on any schedule.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-04-22T01:52:08Z

Codecov Report

❌ Patch coverage is 40.00000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.56%. Comparing base (05fc5bb) to head (3fe7b64).
⚠️ Report is 2 commits behind head on master.

Files with missing lines	Patch %	Lines
superset/commands/report/execute.py	35.71%	7 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #39533      +/-   ##
==========================================
+ Coverage   64.54%   64.56%   +0.01%     
==========================================
  Files        2559     2561       +2     
  Lines      133496   133480      -16     
  Branches    31028    31018      -10     
==========================================
+ Hits        86168    86183      +15     
+ Misses      45836    45803      -33     
- Partials     1492     1494       +2

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

pull-request-size Bot added the size/L label Apr 22, 2026

github-actions Bot added the preset-io label Apr 22, 2026

eschutho commented Apr 22, 2026

View reviewed changes

chore(alerts): soften working_timeout constraint comment in config

3fe7b64

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(alerts): recover reports stuck in WORKING state after worker crash#39533

fix(alerts): recover reports stuck in WORKING state after worker crash#39533
eschutho wants to merge 3 commits intomasterfrom
alerts-reports-working-timeout-sc-104379

eschutho commented Apr 22, 2026 •

edited

Loading

Uh oh!

netlify Bot commented Apr 22, 2026

Uh oh!

eschutho Apr 22, 2026

Uh oh!

codecov Bot commented Apr 22, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	# Must be less than working_timeout on any schedule.
	# It would ideally be less than working_timeout on any schedule.

Sunbelt Computer Software

PL/B Language Development and Support

Conversation

eschutho commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

Uh oh!

netlify Bot commented Apr 22, 2026

✅ Deploy Preview for superset-docs-preview ready!

Uh oh!

eschutho Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

eschutho commented Apr 22, 2026 •

edited

Loading

codecov Bot commented Apr 22, 2026 •

edited

Loading