{{ message }}
Fix deferred index performance and contention failures#7274
Open
emmanuel-keller wants to merge 5 commits intoreleases/2.xfrom
Open
Fix deferred index performance and contention failures#7274emmanuel-keller wants to merge 5 commits intoreleases/2.xfrom
emmanuel-keller wants to merge 5 commits intoreleases/2.xfrom
Conversation
Previously, a new FtIndex was created and torn down for every single document during batch indexing (both initial build and appending phases). Each cycle loaded 4 BTree structures from storage and serialized them back, with cost growing as the index grew -- causing exponential slowdown at scale. Now the FtIndex is created once per batch and reused across all records, with a single finish() call at the end. Status updates are also reduced from per-record to per-batch, eliminating ~250 RwLock acquisitions per batch.
…dexing The deferred daemon was spawned before initial_build_complete was set to true. During this window, maybe_consume would still write !ip keys while the daemon was deleting them, causing write-write conflicts on the same key in RocksDB's optimistic transactions. Fix the race by setting deferred_daemon_running and dropping the InitialBuildGuard (which sets initial_build_complete) BEFORE spawning the daemon. Also upgrade the Relaxed load in maybe_consume to Acquire to properly synchronize with the Release store. Additionally, replace the catch! macro in index_appending_loop with explicit error handling that retries on TxRetryable instead of propagating the error. The daemon loop also retries transient errors as a safety net, rather than permanently killing the index.
Remove redundant deferred_daemon_running store inside spawn_deferred_daemon (already set by callers before spawning). Cancel transaction on TxRetryable commit failure for proper cleanup. Make FtIndex finish best-effort on abort in index_appending_range to avoid masking the abort.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d3ec9b4dc0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
This PR addresses real correctness and performance issues, but it's non-trivial — touching atomic memory ordering, transaction retry semantics, and a race condition in deferred index initialization. Worth a human look before merging.
Extended reasoning...
Overview\nThe PR modifies two files in the core indexing subsystem: and . Changes span three commits: (1) batching FtIndex creation across records within a batch instead of per-document, (2) fixing a race between deferred daemon spawn and initial_build_complete flag visibility, and (3) cleanup/retry handling.\n\n### Security Risks\nNo security-sensitive code is touched — no auth, crypto, or permissions. Risk is limited to correctness of background index building.\n\n### Level of Scrutiny\nThis warrants careful human review. The changes involve subtle atomic ordering decisions (upgrading Ordering::Relaxed to Acquire/Release at key synchronization points), transaction retry loops with manual cancel/sleep/retry semantics, and a race-condition fix that depends on correct ordering of flag stores relative to daemon spawning. The bug reports also flag an asymmetry between index_initial_batch and index_appending_range in how ft.finish() errors are handled on abort, and a potential (narrow but real) scenario where a failed ft.finish() on abort could allow partial data to be committed without the corresponding FtIndex state.\n\n### Other Factors\nThe PR is well-documented with clear commit messages and a thorough description. All existing integration and unit tests are reported passing. The bugs found by the hunting system are real issues — the abort-path inconsistency between the two batch functions is a genuine correctness concern that should be addressed before merge.
Previously the abort branch called ft.finish() best-effort (logging a warning and returning Ok on failure). Because the caller commits the transaction on Ok, a failed finish could persist staged !ip/!ib deletions for already-processed records without the corresponding FtIndex BTree state, permanently dequeuing those records with missing search data. Propagate the error with `?` so the caller cancels the transaction, matching the existing behaviour in index_initial_batch.
When a batch failed with TxRetryable (either from index_appending_range itself or from tx.commit()), the transaction was cancelled but updates_count kept the increments applied during the failed attempt. Because tx.cancel() also rolled back the tx.del(ib) calls, the same !ib keys were re-read and re-processed on the next iteration, so updates_count was incremented a second time for the same records - inflating the updated field reported in BuildingStatus::Indexing and BuildingStatus::Ready. Snapshot updates_count before each batch attempt and restore it in both TxRetryable arms. Index data and queue cleanup were unaffected; this is a reporting-only fix.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Thank you for submitting this pull request. We really appreciate you spending the time to work on SurrealDB. 🚀 🎉
Before proceeding, please add any of the following labels that apply:
breaking-changeif this PR includes a breaking change. This label is not needed for bug fixes, which change output in line with a user's original expectations.needs-documentationif documentation is needed to explain the change made in the PR.Modifies env vars or commandsif any changes are made to environment variables or command flags.What is the motivation?
A user reported two issues with deferred
SEARCHindex building:Exponential slowdown at scale — indexing speed per 1,000 records degrades as the table grows, making large indexes impractical. The rest of the database also slows down (likely from disk pressure).
Contention failures / hanging — when records are modified during indexing, the deferred daemon either fails with "Failed to commit transaction due to a read or write conflict" or hangs indefinitely with unprocessed pending records.
Root causes identified
Slowdown: A new
FtIndexwas created and torn down for every single document during batch indexing (both initial build and appending phases). Each cycle loaded 4 BTree structures (DocIds, DocLengths, Postings, Terms) from storage and serialized them all back — with cost growing as the index grew. Additionally,set_status()was called per-record, acquiringstatus.write()+queue.read()RwLocks ~250 times per batch.Contention: The deferred daemon was spawned before
initial_build_completewas set totrue. During this race window,maybe_consumewould still write!ipkeys (seeing stalefalseviaOrdering::Relaxed) while the daemon was deleting the same!ipkeys — causing write-write conflicts in RocksDB's optimistic transactions. Thecatch!macro then propagated theTxRetryableerror throughindex_appending_loopto the daemon loop, which permanently killed the index with anErrorstatus instead of retrying.What does this change do?
Batch
FtIndexwork across recordsIndexOperation::create_ft_index()/compute_search_with_ft()so anFtIndexcan be reused across multiple documents within a batch.index_initial_batchandindex_appending_range, create theFtIndexonce before the record loop and callfinish()once after, instead of per-document. This removes the repeated load/serialize of the DocIds, DocLengths, Postings and Terms BTrees that was causing the exponential slowdown.set_status()calls from per-record to per-batch, eliminating ~250status.write()+queue.read()RwLock acquisitions per batch.Fix the
!ipkey race between the daemon andmaybe_consumedeferred_daemon_runningand dropInitialBuildGuard(which setsinitial_build_complete) before spawning the daemon, matching the already-correct pattern instart_deferred_index. This closes the window where the daemon andmaybe_consumecould both write to the same!ipkey.initial_build_completeload inmaybe_consumefromOrdering::RelaxedtoOrdering::Acquire, pairing with theReleasestore inInitialBuildGuard::drop.deferred_daemon_runningstore insidespawn_deferred_daemon(the flag is already set by the callers before spawning).Retry transient conflicts instead of killing the index
catch!inindex_appending_loopwith explicit error handling that retriesTxRetryable(cancel tx, sleep 100 ms, re-read keys, retry batch) instead of propagating the error up to the daemon.TxRetryableas retryable rather than fatal, as a safety net.TxRetryablecommit failure for proper cleanup.TxRetryableretry, snapshot and restoreupdates_countaround each batch attempt so the same!ibkeys being re-scanned and re-processed after a cancelled transaction don't double-count theupdatedfield reported inBuildingStatus::Indexing/BuildingStatus::Ready.Propagate
FtIndexfinish errors on abortindex_appending_range, propagateft.finish()errors with?(matchingindex_initial_batch) so the caller cancels the transaction. The earlier best-effort (log-and-continue) behaviour could persist the staged!ip/!ibdeletions for already-processed records without the correspondingFtIndexBTree state, permanently dequeuing those records with missing search data.What is your testing strategy?
deferindexintegration tests pass (deferred_index_survives_restart,insert_parallel_full_text,multi_index_concurrent_test_create_update_delete).idx::ft::unit tests pass.defer-test/defer-indexing-performance.ts) can be used for end-to-end validation at scale.Security Considerations
No security implications — these are performance and correctness fixes to background index building. No changes to authentication, authorization, data isolation, user input handling, or error reporting.
Is this related to any issues?
Have you read the Contributing Guidelines?