[inductor] Bound AsyncCompile._wait_futures with compile_worker_wait_timeout by jeffdaily · Pull Request #181293 · pytorch/pytorch · GitHub
Skip to content

[inductor] Bound AsyncCompile._wait_futures with compile_worker_wait_timeout#181293

Open
jeffdaily wants to merge 1 commit intomainfrom
jeffdaily/inductor_compile_worker_wait_timeout
Open

[inductor] Bound AsyncCompile._wait_futures with compile_worker_wait_timeout#181293
jeffdaily wants to merge 1 commit intomainfrom
jeffdaily/inductor_compile_worker_wait_timeout

Conversation

@jeffdaily
Copy link
Copy Markdown
Collaborator

@jeffdaily jeffdaily commented Apr 23, 2026

Inductor currently calls future.result() on compile-worker futures with no timeout. When a worker's Triton compile stalls, the test subprocess waits until the outer CI 30-min limit fires and gets SIGKILL'd — no signal about which kernel was stuck.

Concrete motivating case: test_sort_dynamic_shape_with_check_cuda hits a super-linear register-allocator complexity pathology in Triton's compile backend on one specific autotune config of a fused sort-with-index kernel. On MI300 the compile completes in 420s; on MI200 it completes in 865s. Either way the shard's 30-min CI budget is consumed by that one compile and the rest of the shard is lost.

With this bound, the test shard continues past the stuck kernel with a RuntimeError naming it, which is immediately actionable:

RuntimeError: Inductor compile-worker future for 'triton_per_fused_sort_0'
did not complete within 300s. Override with
TORCHINDUCTOR_COMPILE_WORKER_WAIT_TIMEOUT=<seconds>.

Default 300s is chosen from empirical CI data: the slowest legitimate single test observed end-to-end is ~130s, so 300s gives ~2.3x margin. Env override is available if future workloads legitimately exceed this.

CodeCacheFuture subclasses without an underlying concurrent.futures.Future attribute are synchronous and complete on .result(); the bound only wraps real cross-process futures.

Authored with Claude.

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo

…timeout

Inductor currently calls future.result() on compile-worker futures with no
timeout. When a worker's Triton compile stalls, the test subprocess waits
until the outer CI 30-min limit fires and gets SIGKILL'd — no signal about
which kernel was stuck.

Concrete motivating case: test_sort_dynamic_shape_with_check_cuda hits a
super-linear register-allocator complexity pathology in Triton's compile
backend on one specific autotune config of a fused sort-with-index kernel.
On MI300 the compile completes in 420s; on MI200 it completes in 865s.
Either way the shard's 30-min CI budget is consumed by that one compile
and the rest of the shard is lost.

With this bound, the test shard continues past the stuck kernel with a
RuntimeError naming it, which is immediately actionable:

    RuntimeError: Inductor compile-worker future for 'triton_per_fused_sort_0'
    did not complete within 300s. Override with
    TORCHINDUCTOR_COMPILE_WORKER_WAIT_TIMEOUT=<seconds>.

Default 300s is chosen from empirical CI data: the slowest legitimate
single test observed end-to-end is ~130s, so 300s gives ~2.3x margin.
Env override is available if future workloads legitimately exceed this.

CodeCacheFuture subclasses without an underlying concurrent.futures.Future
attribute are synchronous and complete on .result(); the bound only wraps
real cross-process futures.

Authored with Claude.
@pytorch-bot pytorch-bot Bot added ciflow/inductor ciflow/torchtitan Run TorchTitan integration tests labels Apr 23, 2026
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 23, 2026

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 23, 2026

This PR needs a release notes: label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copy link
Copy Markdown
Contributor

@jansel jansel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test.

for key, result in kernels.items():
if config.verbose_progress and not isinstance(pbar, _Faketqdm):
pbar.set_postfix_str(key)
# Bound cross-process futures with compile_worker_wait_timeout so
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing this manual unpack and waiting again on L773 can we clean this up and have a single code path? E.g. don't reach inside objects with getattr() -- change the .result() API to take a timeout arg.

@umechand-amd
Copy link
Copy Markdown
Contributor

Looks good to me, just add a test.

Comment thread torch/_inductor/config.py
Comment on lines +1509 to +1511
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the intention is just CI, there is an is_ci env var you can use to set a specific timeout in CI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants