perf(wsq): remove UnboundedWSQ::pop() and relax steal() ordering; by wicyn · Pull Request #795 · taskflow/taskflow · GitHub
Skip to content

perf(wsq): remove UnboundedWSQ::pop() and relax steal() ordering;#795

Closed
wicyn wants to merge 1 commit into
taskflow:masterfrom
wicyn:master
Closed

perf(wsq): remove UnboundedWSQ::pop() and relax steal() ordering;#795
wicyn wants to merge 1 commit into
taskflow:masterfrom
wicyn:master

Conversation

@wicyn

@wicyn wicyn commented Jun 15, 2026

Copy link
Copy Markdown
Contributor
       downgrade BoundedWSQ::steal() top load and pop() CAS to relaxed/acq_rel

UnboundedWSQ is used only as the executor's overflow region: the owner only pushes and thieves only steal — the owner never pops. With pop() gone as dead code, the cross-variable Dekker mutual-exclusion structure between top and bottom disappears, and that structure was the sole reason steal() needed a seq_cst fence and a seq_cst CAS. The downgrades follow from this:

UnboundedWSQ::steal()

  • Remove the seq_cst fence: it existed only to pair with pop()'s fence in the single total order S and arbitrate the "last element" race. With no pop(), the fence has nothing to pair with.
  • CAS success order seq_cst -> acq_rel: the only remaining contention is thief-vs-thief on the single variable top. CAS RMW atomicity guarantees exactly one thief wins, and release/acquire is sufficient to chain top's modification order; no cross-variable total order is required.
  • _top.load acquire -> relaxed: top increases monotonically, so relaxed can at worst read a smaller, stale value, making the t<b check conservative (one missed steal, never out of bounds); the CAS then re-validates against the latest top and returns empty on failure. Slot-data visibility is carried by _bottom.load(acquire) paired with push()'s release, independent of how top is read.

BoundedWSQ::steal()

  • Only _top.load acquire -> relaxed. BoundedWSQ still has pop(), so the Dekker structure remains and both the seq_cst fence and the seq_cst CAS are kept unchanged. Relaxing the top load is safe: the seq_cst fence immediately following it is a full barrier that already prevents the top load from being reordered with the subsequent bottom load and slot load — the LoadLoad reordering that acquire was meant to block is already covered by the fence. Slot visibility likewise depends on bottom's acquire chain, not on how top is read.

BoundedWSQ::pop()

  • CAS success order seq_cst -> acq_rel. The seq_cst fence in pop() is retained, and so is steal()'s seq_cst fence and steal()'s seq_cst CAS. The argument that pop()'s CAS need not enter S: * The Dekker / SB exclusion that prevents fast-path double-take is carried entirely by the two seq_cst fences plus steal()'s seq_cst CAS anchor. pop()'s own side is cleanly SB-shaped (store bottom; seq_cst fence; load top), so its fence does the SB duty on its own — pop()'s CAS is not the anchor for either fence. * pop()'s CAS executes only on the t==b (last-element) path, i.e. never on the fast path. That path is single-variable contention: owner-CAS vs thief-CAS both move top from t to t+1, and RMW atomicity guarantees exactly one winner regardless of memory order. acq_rel provides the acquire needed to observe the loser/winner state and the release to publish the claim; entry into the global total order S is not required. Net effect: pop() keeps its seq_cst fence (load-bearing) but drops one seq_cst RMW on the cold last-element path.

Tests (wsq_test.cpp)

  • UnboundedWSQ tests converted to steal-only; all UnboundedWSQ::pop() calls removed. BoundedWSQ tests are untouched (BoundedWSQ still has pop()).
  • UnboundedWSQ.Resize: drain loop pop() -> steal().
  • UnboundedWSQ.Owner / ValueType: dropped the LIFO pop() verification; the pre-existing steal() FIFO check now covers the single-thread path. Removed the now-duplicated second push round.
  • UnboundedWSQ.*Consumers / *.BulkPush: dropped the "master/bulk push and pop" LIFO blocks; in the concurrent phase the owner only pushes and waits (while(consumed != N) yield()) for thieves to drain, since a steal-only owner cannot pop. items is now collected solely from stolens, and the trailing pop()==nullptr assertion is removed (steal()==nullptr kept).

           downgrade BoundedWSQ::steal() top load and pop() CAS to relaxed/acq_rel

UnboundedWSQ is used only as the executor's overflow region: the owner only
pushes and thieves only steal — the owner never pops. With pop() gone as dead
code, the cross-variable Dekker mutual-exclusion structure between top and
bottom disappears, and that structure was the sole reason steal() needed a
seq_cst fence and a seq_cst CAS. The downgrades follow from this:

UnboundedWSQ::steal()
  - Remove the seq_cst fence: it existed only to pair with pop()'s fence in the
    single total order S and arbitrate the "last element" race. With no pop(),
    the fence has nothing to pair with.
  - CAS success order seq_cst -> acq_rel: the only remaining contention is
    thief-vs-thief on the single variable top. CAS RMW atomicity guarantees
    exactly one thief wins, and release/acquire is sufficient to chain top's
    modification order; no cross-variable total order is required.
  - _top.load acquire -> relaxed: top increases monotonically, so relaxed can
    at worst read a smaller, stale value, making the t<b check conservative
    (one missed steal, never out of bounds); the CAS then re-validates against
    the latest top and returns empty on failure. Slot-data visibility is
    carried by _bottom.load(acquire) paired with push()'s release, independent
    of how top is read.

BoundedWSQ::steal()
  - Only _top.load acquire -> relaxed. BoundedWSQ still has pop(), so the Dekker
    structure remains and both the seq_cst fence and the seq_cst CAS are kept
    unchanged. Relaxing the top load is safe: the seq_cst fence immediately
    following it is a full barrier that already prevents the top load from being
    reordered with the subsequent bottom load and slot load — the LoadLoad
    reordering that acquire was meant to block is already covered by the fence.
    Slot visibility likewise depends on bottom's acquire chain, not on how top
    is read.

BoundedWSQ::pop()
  - CAS success order seq_cst -> acq_rel. The seq_cst fence in pop() is retained,
    and so is steal()'s seq_cst fence and steal()'s seq_cst CAS. The argument
    that pop()'s CAS need not enter S:
      * The Dekker / SB exclusion that prevents fast-path double-take is carried
        entirely by the two seq_cst fences plus steal()'s seq_cst CAS anchor.
        pop()'s own side is cleanly SB-shaped (store bottom; seq_cst fence;
        load top), so its fence does the SB duty on its own — pop()'s CAS is not
        the anchor for either fence.
      * pop()'s CAS executes only on the t==b (last-element) path, i.e. never on
        the fast path. That path is single-variable contention: owner-CAS vs
        thief-CAS both move top from t to t+1, and RMW atomicity guarantees
        exactly one winner regardless of memory order. acq_rel provides the
        acquire needed to observe the loser/winner state and the release to
        publish the claim; entry into the global total order S is not required.
    Net effect: pop() keeps its seq_cst fence (load-bearing) but drops one
    seq_cst RMW on the cold last-element path.

Tests (wsq_test.cpp)
  - UnboundedWSQ tests converted to steal-only; all UnboundedWSQ::pop() calls
    removed. BoundedWSQ tests are untouched (BoundedWSQ still has pop()).
  - UnboundedWSQ.Resize: drain loop pop() -> steal().
  - UnboundedWSQ.Owner / ValueType: dropped the LIFO pop() verification; the
    pre-existing steal() FIFO check now covers the single-thread path. Removed
    the now-duplicated second push round.
  - UnboundedWSQ.*Consumers / *.BulkPush: dropped the "master/bulk push and pop"
    LIFO blocks; in the concurrent phase the owner only pushes and waits
    (while(consumed != N) yield()) for thieves to drain, since a steal-only
    owner cannot pop. `items` is now collected solely from `stolens`, and the
    trailing pop()==nullptr assertion is removed (steal()==nullptr kept).
@wicyn wicyn closed this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant