[inductor] Bound AsyncCompile._wait_futures with compile_worker_wait_timeout#181293
[inductor] Bound AsyncCompile._wait_futures with compile_worker_wait_timeout#181293
Conversation
…timeout
Inductor currently calls future.result() on compile-worker futures with no
timeout. When a worker's Triton compile stalls, the test subprocess waits
until the outer CI 30-min limit fires and gets SIGKILL'd — no signal about
which kernel was stuck.
Concrete motivating case: test_sort_dynamic_shape_with_check_cuda hits a
super-linear register-allocator complexity pathology in Triton's compile
backend on one specific autotune config of a fused sort-with-index kernel.
On MI300 the compile completes in 420s; on MI200 it completes in 865s.
Either way the shard's 30-min CI budget is consumed by that one compile
and the rest of the shard is lost.
With this bound, the test shard continues past the stuck kernel with a
RuntimeError naming it, which is immediately actionable:
RuntimeError: Inductor compile-worker future for 'triton_per_fused_sort_0'
did not complete within 300s. Override with
TORCHINDUCTOR_COMPILE_WORKER_WAIT_TIMEOUT=<seconds>.
Default 300s is chosen from empirical CI data: the slowest legitimate
single test observed end-to-end is ~130s, so 300s gives ~2.3x margin.
Env override is available if future workloads legitimately exceed this.
CodeCacheFuture subclasses without an underlying concurrent.futures.Future
attribute are synchronous and complete on .result(); the bound only wraps
real cross-process futures.
Authored with Claude.
This PR needs a
|
| for key, result in kernels.items(): | ||
| if config.verbose_progress and not isinstance(pbar, _Faketqdm): | ||
| pbar.set_postfix_str(key) | ||
| # Bound cross-process futures with compile_worker_wait_timeout so |
There was a problem hiding this comment.
Rather than doing this manual unpack and waiting again on L773 can we clean this up and have a single code path? E.g. don't reach inside objects with getattr() -- change the .result() API to take a timeout arg.
|
Looks good to me, just add a test. |
There was a problem hiding this comment.
If the intention is just CI, there is an is_ci env var you can use to set a specific timeout in CI

Inductor currently calls future.result() on compile-worker futures with no timeout. When a worker's Triton compile stalls, the test subprocess waits until the outer CI 30-min limit fires and gets SIGKILL'd — no signal about which kernel was stuck.
Concrete motivating case: test_sort_dynamic_shape_with_check_cuda hits a super-linear register-allocator complexity pathology in Triton's compile backend on one specific autotune config of a fused sort-with-index kernel. On MI300 the compile completes in 420s; on MI200 it completes in 865s. Either way the shard's 30-min CI budget is consumed by that one compile and the rest of the shard is lost.
With this bound, the test shard continues past the stuck kernel with a RuntimeError naming it, which is immediately actionable:
Default 300s is chosen from empirical CI data: the slowest legitimate single test observed end-to-end is ~130s, so 300s gives ~2.3x margin. Env override is available if future workloads legitimately exceed this.
CodeCacheFuture subclasses without an underlying concurrent.futures.Future attribute are synchronous and complete on .result(); the bound only wraps real cross-process futures.
Authored with Claude.
cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @jataylo