Fix deferred index performance and contention failures by emmanuel-keller · Pull Request #7274 · surrealdb/surrealdb · GitHub
Skip to content

Fix deferred index performance and contention failures#7274

Open
emmanuel-keller wants to merge 5 commits intoreleases/2.xfrom
emmanuel/fix-defer-index-performance
Open

Fix deferred index performance and contention failures#7274
emmanuel-keller wants to merge 5 commits intoreleases/2.xfrom
emmanuel/fix-defer-index-performance

Conversation

@emmanuel-keller
Copy link
Copy Markdown
Collaborator

@emmanuel-keller emmanuel-keller commented Apr 17, 2026

Thank you for submitting this pull request. We really appreciate you spending the time to work on SurrealDB. 🚀 🎉

Before proceeding, please add any of the following labels that apply:

  • breaking-change if this PR includes a breaking change. This label is not needed for bug fixes, which change output in line with a user's original expectations.
  • needs-documentation if documentation is needed to explain the change made in the PR.
  • Modifies env vars or commands if any changes are made to environment variables or command flags.

What is the motivation?

A user reported two issues with deferred SEARCH index building:

  1. Exponential slowdown at scale — indexing speed per 1,000 records degrades as the table grows, making large indexes impractical. The rest of the database also slows down (likely from disk pressure).

  2. Contention failures / hanging — when records are modified during indexing, the deferred daemon either fails with "Failed to commit transaction due to a read or write conflict" or hangs indefinitely with unprocessed pending records.

Root causes identified

Slowdown: A new FtIndex was created and torn down for every single document during batch indexing (both initial build and appending phases). Each cycle loaded 4 BTree structures (DocIds, DocLengths, Postings, Terms) from storage and serialized them all back — with cost growing as the index grew. Additionally, set_status() was called per-record, acquiring status.write() + queue.read() RwLocks ~250 times per batch.

Contention: The deferred daemon was spawned before initial_build_complete was set to true. During this race window, maybe_consume would still write !ip keys (seeing stale false via Ordering::Relaxed) while the daemon was deleting the same !ip keys — causing write-write conflicts in RocksDB's optimistic transactions. The catch! macro then propagated the TxRetryable error through index_appending_loop to the daemon loop, which permanently killed the index with an Error status instead of retrying.

What does this change do?

Batch FtIndex work across records

  • Add IndexOperation::create_ft_index() / compute_search_with_ft() so an FtIndex can be reused across multiple documents within a batch.
  • In index_initial_batch and index_appending_range, create the FtIndex once before the record loop and call finish() once after, instead of per-document. This removes the repeated load/serialize of the DocIds, DocLengths, Postings and Terms BTrees that was causing the exponential slowdown.
  • Move set_status() calls from per-record to per-batch, eliminating ~250 status.write() + queue.read() RwLock acquisitions per batch.

Fix the !ip key race between the daemon and maybe_consume

  • Set deferred_daemon_running and drop InitialBuildGuard (which sets initial_build_complete) before spawning the daemon, matching the already-correct pattern in start_deferred_index. This closes the window where the daemon and maybe_consume could both write to the same !ip key.
  • Upgrade the initial_build_complete load in maybe_consume from Ordering::Relaxed to Ordering::Acquire, pairing with the Release store in InitialBuildGuard::drop.
  • Remove the now-redundant deferred_daemon_running store inside spawn_deferred_daemon (the flag is already set by the callers before spawning).

Retry transient conflicts instead of killing the index

  • Replace the catch! in index_appending_loop with explicit error handling that retries TxRetryable (cancel tx, sleep 100 ms, re-read keys, retry batch) instead of propagating the error up to the daemon.
  • The daemon loop also treats TxRetryable as retryable rather than fatal, as a safety net.
  • Cancel the transaction on TxRetryable commit failure for proper cleanup.
  • On TxRetryable retry, snapshot and restore updates_count around each batch attempt so the same !ib keys being re-scanned and re-processed after a cancelled transaction don't double-count the updated field reported in BuildingStatus::Indexing / BuildingStatus::Ready.

Propagate FtIndex finish errors on abort

  • In the abort branch of index_appending_range, propagate ft.finish() errors with ? (matching index_initial_batch) so the caller cancels the transaction. The earlier best-effort (log-and-continue) behaviour could persist the staged !ip/!ib deletions for already-processed records without the corresponding FtIndex BTree state, permanently dequeuing those records with missing search data.

What is your testing strategy?

  • All 6 existing deferindex integration tests pass (deferred_index_survives_restart, insert_parallel_full_text, multi_index_concurrent_test_create_update_delete).
  • All 40 idx::ft:: unit tests pass.
  • Clippy clean, no warnings.
  • User's reproduction test (defer-test/defer-indexing-performance.ts) can be used for end-to-end validation at scale.

Security Considerations

No security implications — these are performance and correctness fixes to background index building. No changes to authentication, authorization, data isolation, user input handling, or error reporting.

Is this related to any issues?

  • No related issues

Have you read the Contributing Guidelines?

Previously, a new FtIndex was created and torn down for every single
document during batch indexing (both initial build and appending phases).
Each cycle loaded 4 BTree structures from storage and serialized them
back, with cost growing as the index grew -- causing exponential slowdown
at scale.

Now the FtIndex is created once per batch and reused across all records,
with a single finish() call at the end. Status updates are also reduced
from per-record to per-batch, eliminating ~250 RwLock acquisitions per
batch.
…dexing

The deferred daemon was spawned before initial_build_complete was set to
true. During this window, maybe_consume would still write !ip keys while
the daemon was deleting them, causing write-write conflicts on the same
key in RocksDB's optimistic transactions.

Fix the race by setting deferred_daemon_running and dropping the
InitialBuildGuard (which sets initial_build_complete) BEFORE spawning
the daemon. Also upgrade the Relaxed load in maybe_consume to Acquire
to properly synchronize with the Release store.

Additionally, replace the catch! macro in index_appending_loop with
explicit error handling that retries on TxRetryable instead of
propagating the error. The daemon loop also retries transient errors
as a safety net, rather than permanently killing the index.
@emmanuel-keller emmanuel-keller changed the title Batch FtIndex across records during deferred index building Fix deferred index performance and contention failures Apr 17, 2026
Remove redundant deferred_daemon_running store inside spawn_deferred_daemon
(already set by callers before spawning). Cancel transaction on TxRetryable
commit failure for proper cleanup. Make FtIndex finish best-effort on abort
in index_appending_range to avoid masking the abort.
@emmanuel-keller emmanuel-keller marked this pull request as ready for review April 17, 2026 19:32
@emmanuel-keller emmanuel-keller requested review from a team as code owners April 17, 2026 19:32
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d3ec9b4dc0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/core/src/kvs/index.rs Outdated
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR addresses real correctness and performance issues, but it's non-trivial — touching atomic memory ordering, transaction retry semantics, and a race condition in deferred index initialization. Worth a human look before merging.

Extended reasoning...

Overview\nThe PR modifies two files in the core indexing subsystem: and . Changes span three commits: (1) batching FtIndex creation across records within a batch instead of per-document, (2) fixing a race between deferred daemon spawn and initial_build_complete flag visibility, and (3) cleanup/retry handling.\n\n### Security Risks\nNo security-sensitive code is touched — no auth, crypto, or permissions. Risk is limited to correctness of background index building.\n\n### Level of Scrutiny\nThis warrants careful human review. The changes involve subtle atomic ordering decisions (upgrading Ordering::Relaxed to Acquire/Release at key synchronization points), transaction retry loops with manual cancel/sleep/retry semantics, and a race-condition fix that depends on correct ordering of flag stores relative to daemon spawning. The bug reports also flag an asymmetry between index_initial_batch and index_appending_range in how ft.finish() errors are handled on abort, and a potential (narrow but real) scenario where a failed ft.finish() on abort could allow partial data to be committed without the corresponding FtIndex state.\n\n### Other Factors\nThe PR is well-documented with clear commit messages and a thorough description. All existing integration and unit tests are reported passing. The bugs found by the hunting system are real issues — the abort-path inconsistency between the two batch functions is a genuine correctness concern that should be addressed before merge.

Comment thread crates/core/src/kvs/index.rs
Comment thread crates/core/src/kvs/index.rs
Comment thread crates/core/src/kvs/index.rs
Previously the abort branch called ft.finish() best-effort (logging a
warning and returning Ok on failure). Because the caller commits the
transaction on Ok, a failed finish could persist staged !ip/!ib
deletions for already-processed records without the corresponding
FtIndex BTree state, permanently dequeuing those records with missing
search data.

Propagate the error with `?` so the caller cancels the transaction,
matching the existing behaviour in index_initial_batch.
When a batch failed with TxRetryable (either from index_appending_range
itself or from tx.commit()), the transaction was cancelled but
updates_count kept the increments applied during the failed attempt.
Because tx.cancel() also rolled back the tx.del(ib) calls, the same
!ib keys were re-read and re-processed on the next iteration, so
updates_count was incremented a second time for the same records -
inflating the updated field reported in BuildingStatus::Indexing and
BuildingStatus::Ready.

Snapshot updates_count before each batch attempt and restore it in
both TxRetryable arms. Index data and queue cleanup were unaffected;
this is a reporting-only fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant