{{ message }}
Fix counter underflow race between semaphore fast-path CAS and slow-path fetch_sub#769
Open
Mattbusel wants to merge 1 commit into
Open
Fix counter underflow race between semaphore fast-path CAS and slow-path fetch_sub#769Mattbusel wants to merge 1 commit into
Mattbusel wants to merge 1 commit into
Conversation
…t counter underflow
While stress-testing the semaphore under high contention, I identified a
race between the lock-free fast path and the mutex-protected slow path in
_try_acquire_or_wait that can silently corrupt the token counter.
The race
When _cur_value is 1, two threads can both observe count > 0 before
either has decremented it:
Thread A (fast path): loads cur = 1, begins CAS loop
Thread B (slow path): fast-path loop exits with cur = 0 (spurious),
acquires _mtx, loads cur = 1 with relaxed ordering,
calls fetch_sub(1, relaxed)
If Thread A's CAS and Thread B's fetch_sub both succeed on the same
count-1 slot, the counter wraps to SIZE_MAX (unsigned underflow). Both
threads return true from _try_acquire_or_wait. Both tasks proceed past
the semaphore simultaneously. The concurrency limit is violated.
A secondary issue: the slow-path reload used memory_order_relaxed, so it
could read a stale value that did not reflect a concurrent fast-path
decrement that had already happened. This made the window for the above
race larger than it needed to be.
The fix
Replace fetch_sub(1, relaxed) in the slow path with a compare_exchange_weak
loop, matching the structure of the fast path. The CAS atomically validates
that the value has not changed since the load. If a concurrent fast-path CAS
already claimed the slot, the slow-path CAS fails, reloads the current count,
and either retries (if still > 0) or parks in _waiters (if now 0). Underflow
is structurally impossible.
Change the slow-path reload to memory_order_acquire so it synchronizes with
any fast-path decrement or _release increment that completed before the mutex
was taken.
Change the fast-path CAS success ordering from memory_order_acquire to
memory_order_acq_rel. The release half is required so that the decrement
store is visible to acquire-loads on other threads. Without it, a releaser
that does an acquire-load on the counter could read a stale value and
incorrectly conclude no token was taken.
Move the _release increment inside the mutex. The previous design incremented
the counter with a lock-free CAS before acquiring _mtx to drain _waiters.
This creates a window where a fast-path acquire steals the new token, and
_release then also wakes all parked waiters. Both the fast-path acquirer and
the rescheduled waiters hold the semaphore simultaneously. Incrementing inside
the lock closes this window: a fast-path acquire cannot see the new count
until _release has finished draining _waiters, so the token goes to exactly
one side.
The acquire fast path remains fully lock-free. The mutex is taken only when
the count is exhausted, preserving the intent of the original design. The
_release path now matches the semantics: one mutex acquisition handles both
the increment and the drain atomically.
Testing
All 34 existing semaphore tests pass (80388 assertions), including the
high-thread-count critical section and linear chain tests that expose the
contention patterns most likely to trigger this race.
Member
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

While stress-testing the semaphore under high contention, I identified a race between the lock-free fast path and the mutex-protected slow path in
_try_acquire_or_waitthat silently corrupts the token counter.The race
When
_cur_valueis 1, two threads can both observe count > 0 before either has decremented it:If Thread A's CAS and Thread B's
fetch_subboth succeed on the same count-1 slot, the counter wraps toSIZE_MAX(unsigned underflow). Both threads returntrue. Both tasks run past the semaphore simultaneously. The concurrency limit is violated. In the test suite this shows up ascounter > Nin the critical section test under 2+ workers.A secondary issue: the slow-path reload used
memory_order_relaxed, so it could read a value that did not reflect a concurrent fast-path decrement that had already committed. This widened the race window unnecessarily.The fix
Replace
fetch_sub(1, relaxed)in the slow path with acompare_exchange_weakloop, matching the fast path. The CAS atomically validates that the value has not changed since the load. If a concurrent fast-path CAS already claimed the slot, the slow-path CAS fails, reloads, and either retries (count still > 0) or parks in_waiters(count now 0). Underflow is structurally impossible with this approach.The slow-path reload is changed to
memory_order_acquireto synchronize with any fast-path decrement or_releaseincrement that completed before the mutex was taken.The fast-path CAS success ordering is changed from
memory_order_acquiretomemory_order_acq_rel. The release half is necessary so that the decrement store is visible to acquire-loads on other threads. Without it a releaser reading the counter with an acquire-load could observe a stale value.The
_releaseincrement is moved inside the mutex. Previously the counter was incremented with a lock-free CAS before_mtxwas acquired to drain_waiters. This opened a window where a fast-path acquire stole the new token, then_releasealso woke all parked waiters, giving both sides a concurrent hold on the semaphore. Incrementing inside the lock closes that window: no fast-path acquire can see the new count until the drain completes.The acquire fast path remains lock-free. The mutex is only taken when the count is exhausted.
Testing
All 34 existing semaphore tests pass (80388 assertions), including the high-thread-count critical section tests and the linear chain tests that accumulate large waiter lists and expose the contention patterns most likely to trigger this race.