iframe-proxy

ryzhyk · 2026-04-08T22:03:41Z

The balancer used to trigger a rebalancing on a restart from a checkpoint. This was a bad idea in hindsight. Not only it slowed down failover, it also caused the initial step executed before the pipeline transitioned to the RUNNING state potentially take a very long time. During this time none of the monitoring features are available, making this a potentially very long and awkward silence.

This commit disables this behavior. It also introduces a new DBSP-level API that disables automatic rebalancing altogether. The plan is to expose this API externally, allowing the user to control when rebalancings are allowed to happen, e.g., a latency-sensitive workload may disable rebalancing after initial backfill. This will be combined with an API to request a rebalancing on demand.

For now this API is only used to disable rebalancing during the initial step, making sure that a regular (non-forced) rebalancing doesn't trigger accidentally at that point.

This commit additionally skips the initial step when the pipeline is bootstrapping. The reason is similar to the above: bootstrapping steps can be expensive, so we don't want to perform them while the pipeline is initializing. This does mean that the pipeline won't initialize output snapshots until bootstrapping completes, but that was already the case previously.

The above fix revealed a bug in the implementation of adaptive joins during bootstrapping. The PR includes bug fix commits.

Describe Manual Test Plan

Checklist

Unit tests added/updated
Integration tests added/updated
Documentation updated
Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

OpenAPI / REST HTTP API / feldera-types / manager (What is a breaking change?)
Feldera SQL (Syntax, Semantics)
feldera-sqllib (incl. dependencies fxp, etc.) (What is a breaking change?)
Python SDK (What is a breaking change?)
fda (CLI arguments)
Adapters (including configuration)
Storage Format / Checkpoints
Others (specify)

Describe Incompatible Changes

mythical-fred

LGTM

blp

Can not rebalancing on the initial step make that step take much longer?

ryzhyk · 2026-04-09T16:34:30Z

ryzhyk · 2026-04-09T16:35:28Z

PS. I need to add an integration test before marking this PR ready.

blp · 2026-04-09T16:44:20Z

Can not rebalancing on the initial step make that step take much longer?

The initial step doesn't get any input and should normally complete instantaneously. We've identified two cases where this was not true: (1) rebalancing kicks in, which can be as bad as complete backfill in the worst case, (2) bootstrapping kicks -- less dramatic, but can still be bad.

Oh, I get it now, I think: without rebalancing, there's no work to do because there's no input; with rebalancing, the entire integral up to this point ends up being processed. Is that right?

ryzhyk · 2026-04-09T17:04:40Z

Can not rebalancing on the initial step make that step take much longer?

The initial step doesn't get any input and should normally complete instantaneously. We've identified two cases where this was not true: (1) rebalancing kicks in, which can be as bad as complete backfill in the worst case, (2) bootstrapping kicks -- less dramatic, but can still be bad.

Oh, I get it now, I think: without rebalancing, there's no work to do because there's no input; with rebalancing, the entire integral up to this point ends up being processed. Is that right?

correct!

mythical-fred

LGTM

The balancer used to trigger a rebalancing on a restart from a checkpoint. This was a bad idea in hindsight. Not only it slowed down failover, it also caused the initial step executed before the pipeline transitioned to the RUNNING state potentially take a very long time. During this time none of the monitoring features are available, making this a very long an awkward silence. This commit disables this behavior. It also introduces a new DBSP-level API that disables automatic rebalancing altogether. The plan is to expose this API externally, allowing the user to control when rebalancings are allowed to happen, e.g., a latency-sensitive workload may disable rebalancing after initial backfill. This will be combined with an API to request a rebalancing on demand. For now this API is only used to disable rebalancing during the initial step, making sure that a regular (non-forced) rebalancing doesn't trigger accidentally at that point. This commit additionally skips the initial step when the pipeline is bootstrapping. The reason is similar to the above: bootstrapping steps can be expensive, so we don't want to perform them while the pipeline is initializing. This does mean that the pipeline won't initialize output snapshots until bootstrapping completes, but that was already the case previously. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Dedup a repetitive expression. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Improve debug output in replay tests. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Add a test that combines rebalancing, checkpointing, and bootstrapping. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

This fixes a bug in adaptive joins that can manifest during bootstrapping. We did not correctly handle streams that did not participate in bootstrapping during bootstrapping. Such streams cannot change their balancing policy until bootstrapping completes; however we allowed the balancer to assign new policies to such streams. These policies were ignored leading to incorrect output and inconsistent internal state. As part of this fix, we also distinguish streams that haven't been assigned a policy yet instead of assigning a default policy (Shard) to them. Such default policies can be inconsistent during bootstrapping. This also improved debuggability as we can distinguish cases when the balancer when the balancer hasn't assigned a policy yet. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

In replay mode, Z1Trace and AccumulateZ1Trace operators could continue producing outputs after returning `true` from `is_flush_complete`, which violates the transaction commit protocol. This commit fixes this behavior. It also addresses a related issue where strict operators (Z1) did not distinguish between `flush` being called on the input vs output half of the operator. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

During bootstrapping, disabled operators do not participate in transactions and we shouldn't call start_transaction on such operators. This could cause a livelock in situations where some of the streams in a join cluster were inactive. Calling start_transaction on such streams set their flush_state to TransactionStarted. As a result RabalancingExchangeSender::ready_to_commit always returned false for all other streams in the cluster. With this commit, we no longer call `start_transaction` on disabled operator. Additionally, we fix the `ready_to_commit` implementation to ignore disabled operators when deciding whether to commit a transaction. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Make the test more complicated to trigger the bug fixed in the previous commit. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

mihaibudiu · 2026-04-23T23:22:31Z

            //                 None
            //             };
-            //             Some(crate::utils::DotNodeAttributes::new().with_color(color))
+            //             Some(crate::utils::DotNodeAttributes::new().with_color(color).with_label(&format!("{}:{}", node.name(), node.local_id())))


This is still alive?

yes, I often uncomment it for debugging

ryzhyk requested a review from blp April 8, 2026 22:03

ryzhyk added the DBSP core Related to the core DBSP library label Apr 8, 2026

mythical-fred approved these changes Apr 9, 2026

View reviewed changes

blp approved these changes Apr 9, 2026

View reviewed changes

ryzhyk force-pushed the init_step branch from 74e8623 to adb56a2 Compare April 12, 2026 17:49

ryzhyk marked this pull request as ready for review April 12, 2026 17:50

mythical-fred approved these changes Apr 12, 2026

View reviewed changes

ryzhyk temporarily deployed to ci April 16, 2026 14:51 — with GitHub Actions Inactive

ryzhyk added 8 commits April 23, 2026 11:30

[dbsp] Small cleanup.

29a0f94

Dedup a repetitive expression. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

[dbsp] Debug output.

bda51e6

Improve debug output in replay tests. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

[py] Integration test for adaptive joins.

58da169

Add a test that combines rebalancing, checkpointing, and bootstrapping. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

[dbsp] Improved bootstrapping test.

d4f39a1

Make the test more complicated to trigger the bug fixed in the previous commit. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

ryzhyk force-pushed the init_step branch from 5be2775 to d4f39a1 Compare April 23, 2026 23:13

ryzhyk enabled auto-merge April 23, 2026 23:16

ryzhyk added this pull request to the merge queue Apr 23, 2026

mihaibudiu reviewed Apr 23, 2026

View reviewed changes

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 23, 2026

ryzhyk added this pull request to the merge queue Apr 23, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 24, 2026

ryzhyk added this pull request to the merge queue Apr 24, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 24, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dbsp] Don't rebalance during initial step.#6013

[dbsp] Don't rebalance during initial step.#6013
ryzhyk wants to merge 8 commits intomainfrom
init_step

ryzhyk commented Apr 8, 2026 •

edited

Loading

Uh oh!

mythical-fred left a comment

Uh oh!

blp left a comment

Uh oh!

ryzhyk commented Apr 9, 2026

Uh oh!

ryzhyk commented Apr 9, 2026

Uh oh!

blp commented Apr 9, 2026

Uh oh!

ryzhyk commented Apr 9, 2026

Uh oh!

mythical-fred left a comment

Uh oh!

mihaibudiu Apr 23, 2026

Uh oh!

ryzhyk Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Conversation

ryzhyk commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe Manual Test Plan

Checklist

Breaking Changes?

Describe Incompatible Changes

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

blp left a comment

Choose a reason for hiding this comment

Uh oh!

ryzhyk commented Apr 9, 2026

Uh oh!

ryzhyk commented Apr 9, 2026

Uh oh!

blp commented Apr 9, 2026

Uh oh!

ryzhyk commented Apr 9, 2026

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

mihaibudiu Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

ryzhyk Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ryzhyk commented Apr 8, 2026 •

edited

Loading