[dbsp] Use `Cell` in place of `RefCell` where we can. by blp · Pull Request #5956 · feldera/feldera · GitHub
Skip to content

[dbsp] Use Cell in place of RefCell where we can.#5956

Merged
blp merged 2 commits intomainfrom
cell
Apr 3, 2026
Merged

[dbsp] Use Cell in place of RefCell where we can.#5956
blp merged 2 commits intomainfrom
cell

Conversation

@blp
Copy link
Copy Markdown
Member

@blp blp commented Mar 30, 2026

RefCell is relatively expensive because it maintains a borrow counter and associated lock-like logic. Cell is just a transparent wrapper around the inner type, so that it uses less time and space.

Describe Manual Test Plan

I ran the unit tests.

@blp blp requested a review from ryzhyk March 30, 2026 19:18
@blp blp self-assigned this Mar 30, 2026
@blp blp added DBSP core Related to the core DBSP library performance rust Pull requests that update Rust code labels Mar 30, 2026
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a no-brainer. Thank you!

@blp blp added this pull request to the merge queue Apr 1, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 2, 2026
@blp
Copy link
Copy Markdown
Member Author

blp commented Apr 2, 2026

@blp
Copy link
Copy Markdown
Member Author

blp commented Apr 2, 2026

The panic comes inside Drop for ImmutableFileRef, which atempts to evict the file being dropped from the buffer cache. But ImmutableFileRef doesn't have a reference to the BufferCache, it only has a fn() -> Arc<BufferCache> that it can call to get a buffer cache. This fn is actually Runtime::buffer_cache, which is what's panicking.

I guess what's happening is that a merger thread is opening a Reader and taking an Arc to it, and that occasionally the last reference to that Arc is somehow dropped outside the merger task but still in a thread created by the merger (and therefore marked as ThreadType::Background). That would naturally produce the panic in question.

@gz @ryzhyk Any thoughts?

@blp
Copy link
Copy Markdown
Member Author

blp commented Apr 2, 2026

One option would be to drop the attempt to evict data in the ImmutableFileRef drop implementation. I added that because I guessed it would help, but I do not have any measurements demonstrating it.

@blp
Copy link
Copy Markdown
Member Author

blp commented Apr 2, 2026

Another solution would be to give ImmutableFileRef a direct handle to its buffer cache rather than a function to call to get one. The status quo is there because buffer caches were (and often are) worker- or thread-specific. But then, at least for the thread-specific case, we'd have to figure out how or when to switch buffer caches.

If we had only a single buffer cache for the runtime, there would not be a problem.

@ryzhyk
Copy link
Copy Markdown
Contributor

ryzhyk commented Apr 2, 2026

I guess the question is how can this happen outside of one of the tasks we spawn? We terminate the tokio runtime when killing the DBSP runtime like this:

        // Terminate the tokio merger runtime.
        if let Some(tokio_merger_runtime) = self
            .runtime
            .inner()
            .tokio_merger_runtime
            .lock()
            .unwrap()
            .take()
        {
            // Block until all running merger tasks have yielded. At this point, the tokio runtime will have shut down automatically.
            drop(tokio_merger_runtime);
        }

This is probably when this happens. But why?

@gz
Copy link
Copy Markdown
Contributor

gz commented Apr 2, 2026

One option would be to drop the attempt to evict data in the ImmutableFileRef drop implementation. I added that because I guessed it would help, but I do not have any measurements demonstrating it.

Since we don't know if it helps we could just drop this 'feature'.

I was questioning this choice when we worked on FBuf allocation.
I agree it could be helpful but I think most caches don't do it and it could also be harmful causing somewhat unnecessary churn on the cache.

If we trust the eviction policy to be "good enough" it should be fine to just skip the step.
It might also happen because of our Runtime/NoRuntime mess in testing.

@ryzhyk
Copy link
Copy Markdown
Contributor

ryzhyk commented Apr 2, 2026

Apparently this is to be expected:
image

@blp
Copy link
Copy Markdown
Member Author

blp commented Apr 2, 2026

Destructors are always trouble.

@blp
Copy link
Copy Markdown
Member Author

blp commented Apr 2, 2026

Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix makes sense.

`RefCell` is relatively expensive because it maintains a borrow counter
and associated lock-like logic.  `Cell` is just a transparent wrapper
around the inner type, so that it uses less time and space.

Signed-off-by: Ben Pfaff <blp@feldera.com>
@blp blp enabled auto-merge April 3, 2026 14:50
@blp blp added this pull request to the merge queue Apr 3, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 3, 2026
Sometimes, running the unit tests failed with:

```
panicked at crates/dbsp/src/circuit/runtime.rs:972:64:
cannot access a task-local storage value without setting it first
thread panicked while processing panic. aborting.
Aborted (core dumped)
```

The panic comes inside Drop for ImmutableFileRef, which attempts to
evict the file being dropped from the buffer cache.  But
ImmutableFileRef doesn't have a reference to the BufferCache, it only
has a fn() -> Arc<BufferCache> that it can call to get a buffer
cache.  This fn is actually Runtime::buffer_cache, which is what's
panicking.

The problem is that tokio task shutdown does not provide values for
task-local variables, which is what stores the buffer cache.  This
commit avoids the problem by allowing the buffer cache to be
unavailable for Drop in ImmutableFileRef.  This requires some code
refactoring so that the cache fn, and Runtime::buffer_cache(),
return an Option.

Signed-off-by: Ben Pfaff <blp@feldera.com>
@blp blp enabled auto-merge April 3, 2026 15:32
@blp blp added this pull request to the merge queue Apr 3, 2026
Merged via the queue into main with commit 7d27972 Apr 3, 2026
1 check passed
@blp blp deleted the cell branch April 3, 2026 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DBSP core Related to the core DBSP library performance rust Pull requests that update Rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants