iframe-proxy

jeffdaily · 2026-04-23T21:43:55Z

Inductor currently calls future.result() on compile-worker futures with no timeout. When a worker's Triton compile stalls, the test subprocess waits until the outer CI 30-min limit fires and gets SIGKILL'd — no signal about which kernel was stuck.

Concrete motivating case: test_sort_dynamic_shape_with_check_cuda hits a super-linear register-allocator complexity pathology in Triton's compile backend on one specific autotune config of a fused sort-with-index kernel. On MI300 the compile completes in 420s; on MI200 it completes in 865s. Either way the shard's 30-min CI budget is consumed by that one compile and the rest of the shard is lost.

With this bound, the test shard continues past the stuck kernel with a RuntimeError naming it, which is immediately actionable:

RuntimeError: Inductor compile-worker future for 'triton_per_fused_sort_0'
did not complete within 300s. Override with
TORCHINDUCTOR_COMPILE_WORKER_WAIT_TIMEOUT=<seconds>.

Default 300s is chosen from empirical CI data: the slowest legitimate single test observed end-to-end is ~130s, so 300s gives ~2.3x margin. Env override is available if future workloads legitimately exceed this.

CodeCacheFuture subclasses without an underlying concurrent.futures.Future attribute are synchronous and complete on .result(); the bound only wraps real cross-process futures.

Authored with Claude.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

…timeout Inductor currently calls future.result() on compile-worker futures with no timeout. When a worker's Triton compile stalls, the test subprocess waits until the outer CI 30-min limit fires and gets SIGKILL'd — no signal about which kernel was stuck. Concrete motivating case: test_sort_dynamic_shape_with_check_cuda hits a super-linear register-allocator complexity pathology in Triton's compile backend on one specific autotune config of a fused sort-with-index kernel. On MI300 the compile completes in 420s; on MI200 it completes in 865s. Either way the shard's 30-min CI budget is consumed by that one compile and the rest of the shard is lost. With this bound, the test shard continues past the stuck kernel with a RuntimeError naming it, which is immediately actionable: RuntimeError: Inductor compile-worker future for 'triton_per_fused_sort_0' did not complete within 300s. Override with TORCHINDUCTOR_COMPILE_WORKER_WAIT_TIMEOUT=<seconds>. Default 300s is chosen from empirical CI data: the slowest legitimate single test observed end-to-end is ~130s, so 300s gives ~2.3x margin. Env override is available if future workloads legitimately exceed this. CodeCacheFuture subclasses without an underlying concurrent.futures.Future attribute are synchronous and complete on .result(); the bound only wraps real cross-process futures. Authored with Claude.

pytorch-bot · 2026-04-23T21:43:59Z

pytorch-bot · 2026-04-23T21:44:03Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

jansel

Add a test.

jansel · 2026-04-23T22:05:16Z

        for key, result in kernels.items():
            if config.verbose_progress and not isinstance(pbar, _Faketqdm):
                pbar.set_postfix_str(key)
+            # Bound cross-process futures with compile_worker_wait_timeout so


Rather than doing this manual unpack and waiting again on L773 can we clean this up and have a single code path? E.g. don't reach inside objects with getattr() -- change the .result() API to take a timeout arg.

umechand-amd · 2026-04-23T22:29:31Z

Looks good to me, just add a test.

eellison · 2026-04-23T23:27:54Z

+compile_worker_wait_timeout: int = int(
+    os.environ.get("TORCHINDUCTOR_COMPILE_WORKER_WAIT_TIMEOUT", "300")
+)


If the intention is just CI, there is an is_ci env var you can use to set a specific timeout in CI

jeffdaily requested review from desertfire, eellison, jansel and malfet April 23, 2026 21:43

pytorch-bot Bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests labels Apr 23, 2026

pytorch-bot Bot added the module: inductor label Apr 23, 2026

pytorchbot added the open source label Apr 23, 2026

jansel requested changes Apr 23, 2026

View reviewed changes

eellison reviewed Apr 23, 2026

View reviewed changes

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Bound AsyncCompile._wait_futures with compile_worker_wait_timeout#181293

[inductor] Bound AsyncCompile._wait_futures with compile_worker_wait_timeout#181293
jeffdaily wants to merge 1 commit intomainfrom
jeffdaily/inductor_compile_worker_wait_timeout

jeffdaily commented Apr 23, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Apr 23, 2026

Uh oh!

jansel left a comment

Uh oh!

jansel Apr 23, 2026

Uh oh!

umechand-amd commented Apr 23, 2026

Uh oh!

eellison Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Sunbelt Computer Software

PL/B Language Development and Support

Conversation

jeffdaily commented Apr 23, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/181293

❗ 1 Active SEVs

❌ 1 New Failure, 2 Pending

Uh oh!

pytorch-bot Bot commented Apr 23, 2026

This PR needs a release notes: label

Uh oh!

jansel left a comment

Choose a reason for hiding this comment

Uh oh!

jansel Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

umechand-amd commented Apr 23, 2026

Uh oh!

eellison Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jeffdaily commented Apr 23, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Apr 23, 2026 •

edited

Loading

This PR needs a `release notes:` label