Netdata fixes part 25 by stelfrag · Pull Request #22958 · netdata/netdata · GitHub
Skip to content

Netdata fixes part 25#22958

Draft
stelfrag wants to merge 4 commits into
netdata:masterfrom
stelfrag:netdata-fixes-25
Draft

Netdata fixes part 25#22958
stelfrag wants to merge 4 commits into
netdata:masterfrom
stelfrag:netdata-fixes-25

Conversation

@stelfrag

@stelfrag stelfrag commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator
Summary

Summary by cubic

Fixes several data races and a Windows sleep issue across ACLK MQTT, capabilities, and WebRTC. Improves thread safety and prevents high-CPU spins, with no functional changes.

  • Bug Fixes
    • ACLK capabilities: make the capability array thread‑local to avoid concurrent writes; API and output unchanged.
    • MQTT: keep PINGREQ fragment per client instead of a global; add a stable fragment pointer so buffer compaction/growth doesn’t invalidate it; preserve stats and latency accounting.
    • Windows: make tinysleep() use Sleep(1) instead of Sleep(0) to actually sleep; update comment to match behavior.
    • WebRTC: store open = true with a relaxed atomic to match readers/writers and remove the data race.

Written for commit 8819ec0. Summary will update on new commits.

Review in cubic

ktsaou added 4 commits July 1, 2026 14:56
aclk_get_agent_capas() returned a borrowed pointer to a process-static mutable capability array and rewrote dynamic entries on every call. Concurrent ACLK serializers could interleave those writes with reads of the same array, creating a current source-level data race in capability publication.

Use thread-local storage for the existing static array. This preserves the borrowed-pointer API and keeps the same capability order, sentinel, dynamic value assignments, and protobuf serialization while giving each calling thread its own mutable copy.

Functional impact: intended output is unchanged. Callers still receive a borrowed pointer, serializers still consume it synchronously, and no ownership contract changes.

Performance impact: replaces process-static access with TLS access and adds no heap allocation, lock, loop, or serialization work.

Security impact: removes an unsynchronized shared-memory race during ACLK capability serialization.

Validation: git diff --check -- src/aclk/aclk_capas.c; cmake --build /tmp/netdata-mem-corruptions-pr-comments-debugfs-build --target src/aclk/aclk_capas.o -j 4.

Reviewer gate: required reviewers approved the final one-file diff.

(cherry picked from commit 1a20d01)
On Windows, tinysleep used Sleep(0), which only yields the remainder of the current time slice and leaves the waiter runnable. Current impact is that object_state_deactivate() and refcount deletion waits can spin at high CPU while waiting for holder threads; potential impact is the same scheduler starvation in any tinysleep wait loop.

Use Sleep(1) for Windows tinysleep so waiters become non-runnable briefly, matching the helper's sleep semantics and the POSIX nanosleep branch. Leave yield_the_processor() as Sleep(0) for code that explicitly wants a yield.

Functional impact: Windows tinysleep callers now wait for a real short sleep instead of a zero-duration yield; no API, config, metric, protocol, or command surface changes. POSIX behavior is unchanged.

Performance impact: Windows wait/backoff loops can add about 1 ms per retry, bounded by the process timer resolution, while avoiding CPU-bound spins that delay holder or releaser work.

Security impact: reduces denial-of-service risk from a high-CPU wait loop and introduces no new input, memory ownership, credentials, or privilege boundary.

Validation: git diff --check; source search for Sleep(0), Sleep(1), SwitchToThread, and tinysleep; required independent source review approved the change.
(cherry picked from commit b675c10)
WEBRTC_DC.open is documented and read as an atomic flag. The close callback already stores false with __atomic_store_n(), and all active readers go through webrtc_dc_is_open(), but myOpenCallback() stored true with a plain assignment.

This is a current bug in HAVE_LIBDATACHANNEL builds: libdatachannel callbacks and web request handling can observe WEBRTC_DC.open concurrently, so mixing one plain writer with atomic readers/writers creates a C memory-model data race.

Store the open=true transition with __atomic_store_n(..., __ATOMIC_RELAXED), matching the existing close=false store and the relaxed load helper. The callback still writes the same value at the same point after the same checks and logging.

Functional impact: no intended behavior change. The data-channel open contract, callback registration, return contract, request handling, and cleanup flow are unchanged; only the storage primitive becomes consistent with the field contract.

Performance impact: negligible. This adds one relaxed atomic store on a data-channel lifecycle callback, not on a hot request loop.

(cherry picked from commit a894baa)
The MQTT client kept PINGREQ send-progress state in a file-scope ping_frag while mqtt_ng_init() can create independent mqtt_ng_client instances. That made fragment fields such as sent and sent_monotonic_ut shared by every client, even though the state is logically per connection.

This is a potential data race and state-aliasing bug. Current production appears to use a single ACLK MQTT client, but the library boundary and tests allow multiple client instances, so the global mutable fragment was the wrong lifetime for the data it stored.

Move the mutable ping fragment into struct mqtt_ng_client and initialize it with the same PINGREQ bytes and flags for each client. Add a transaction-buffer stable_frag pointer so buffer compaction and growth still preserve the external ping fragment while invalidating ordinary header-buffer fragments as before.

Functional impact: single-client protocol behavior is unchanged. PINGREQ ordering, sent accounting, queued-message accounting, and PINGRESP latency accounting keep the same semantics; they now use the owning client state instead of process-global mutable state.

Performance impact: negligible. The change adds one buffer_fragment in each client and one pointer in the private transaction buffer, with no new allocation, lock, atomic operation, loop, or network-path work.

(cherry picked from commit e154e1c)

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 5 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Thread as Worker Threads
    participant Capas as Agent Capabilities
    participant ClientA as MQTT Client A
    participant ClientB as MQTT Client B
    participant BufA as Transaction Buffer A
    participant BufB as Transaction Buffer B
    participant PingFragA as A's Ping Fragment
    participant PingFragB as B's Ping Fragment
    participant WebRTC as WebRTC Data Channel
    participant Sleep as tinysleep()
    participant OS as OS Scheduler

    Note over Thread,Capas: NEW: Thread‑local capability array
    Thread->>Capas: aclk_get_agent_capas()
    Capas-->>Thread: Pointer to __thread array

    Note over ClientA,PingFragA: NEW: Per‑client ping fragment (was global)
    ClientA->>ClientA: mqtt_ng_init()
    ClientA->>PingFragA: Initialize frag with pingreq data
    ClientA->>BufA: stable_frag = &client->ping_frag

    ClientB->>ClientB: mqtt_ng_init()
    ClientB->>PingFragB: Initialize frag
    ClientB->>BufB: stable_frag = &client->ping_frag

    loop Send cycle (Client A)
        ClientA->>ClientA: mqtt_ng_next_to_send()
        alt Ping pending
            ClientA->>PingFragA: Reset sent, sent_monotonic_ut
            ClientA->>BufA: sending_frag = &ping_frag
        end
        ClientA->>BufA: send_fragment()
        BufA->>BufA: Check if frag != stable_frag
        alt Is ping frag? (frag == stable_frag)
            BufA->>BufA: Keep frag pointer (no invalidation)
        else Other frag
            BufA->>BufA: Invalidate sending_frag on compaction/growth
        end
    end

    Note over WebRTC: CHANGED: Atomic store for open flag
    WebRTC->>WebRTC: myOpenCallback()
    WebRTC->>WebRTC: __atomic_store_n(&chan->open, true, RELAXED)
    Note right of WebRTC: Readers use __atomic_load_n

    Note over Sleep,OS: CHANGED: Windows sleep 1ms (was busy‑wait)
    Sleep->>Sleep: tinysleep()
    Sleep->>OS: Sleep(1) → yield for ~1ms
    OS-->>Sleep: resume after sleep

    Note over ClientB,PingFragB: Same per‑client pattern for Client B
Loading

Re-trigger cubic

@sonarqubecloud

sonarqubecloud Bot commented Jul 3, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants