{{ message }}
Netdata fixes part 25#22958
Draft
stelfrag wants to merge 4 commits into
Draft
Conversation
aclk_get_agent_capas() returned a borrowed pointer to a process-static mutable capability array and rewrote dynamic entries on every call. Concurrent ACLK serializers could interleave those writes with reads of the same array, creating a current source-level data race in capability publication. Use thread-local storage for the existing static array. This preserves the borrowed-pointer API and keeps the same capability order, sentinel, dynamic value assignments, and protobuf serialization while giving each calling thread its own mutable copy. Functional impact: intended output is unchanged. Callers still receive a borrowed pointer, serializers still consume it synchronously, and no ownership contract changes. Performance impact: replaces process-static access with TLS access and adds no heap allocation, lock, loop, or serialization work. Security impact: removes an unsynchronized shared-memory race during ACLK capability serialization. Validation: git diff --check -- src/aclk/aclk_capas.c; cmake --build /tmp/netdata-mem-corruptions-pr-comments-debugfs-build --target src/aclk/aclk_capas.o -j 4. Reviewer gate: required reviewers approved the final one-file diff. (cherry picked from commit 1a20d01)
On Windows, tinysleep used Sleep(0), which only yields the remainder of the current time slice and leaves the waiter runnable. Current impact is that object_state_deactivate() and refcount deletion waits can spin at high CPU while waiting for holder threads; potential impact is the same scheduler starvation in any tinysleep wait loop. Use Sleep(1) for Windows tinysleep so waiters become non-runnable briefly, matching the helper's sleep semantics and the POSIX nanosleep branch. Leave yield_the_processor() as Sleep(0) for code that explicitly wants a yield. Functional impact: Windows tinysleep callers now wait for a real short sleep instead of a zero-duration yield; no API, config, metric, protocol, or command surface changes. POSIX behavior is unchanged. Performance impact: Windows wait/backoff loops can add about 1 ms per retry, bounded by the process timer resolution, while avoiding CPU-bound spins that delay holder or releaser work. Security impact: reduces denial-of-service risk from a high-CPU wait loop and introduces no new input, memory ownership, credentials, or privilege boundary. Validation: git diff --check; source search for Sleep(0), Sleep(1), SwitchToThread, and tinysleep; required independent source review approved the change. (cherry picked from commit b675c10)
WEBRTC_DC.open is documented and read as an atomic flag. The close callback already stores false with __atomic_store_n(), and all active readers go through webrtc_dc_is_open(), but myOpenCallback() stored true with a plain assignment. This is a current bug in HAVE_LIBDATACHANNEL builds: libdatachannel callbacks and web request handling can observe WEBRTC_DC.open concurrently, so mixing one plain writer with atomic readers/writers creates a C memory-model data race. Store the open=true transition with __atomic_store_n(..., __ATOMIC_RELAXED), matching the existing close=false store and the relaxed load helper. The callback still writes the same value at the same point after the same checks and logging. Functional impact: no intended behavior change. The data-channel open contract, callback registration, return contract, request handling, and cleanup flow are unchanged; only the storage primitive becomes consistent with the field contract. Performance impact: negligible. This adds one relaxed atomic store on a data-channel lifecycle callback, not on a hot request loop. (cherry picked from commit a894baa)
The MQTT client kept PINGREQ send-progress state in a file-scope ping_frag while mqtt_ng_init() can create independent mqtt_ng_client instances. That made fragment fields such as sent and sent_monotonic_ut shared by every client, even though the state is logically per connection. This is a potential data race and state-aliasing bug. Current production appears to use a single ACLK MQTT client, but the library boundary and tests allow multiple client instances, so the global mutable fragment was the wrong lifetime for the data it stored. Move the mutable ping fragment into struct mqtt_ng_client and initialize it with the same PINGREQ bytes and flags for each client. Add a transaction-buffer stable_frag pointer so buffer compaction and growth still preserve the external ping fragment while invalidating ordinary header-buffer fragments as before. Functional impact: single-client protocol behavior is unchanged. PINGREQ ordering, sent accounting, queued-message accounting, and PINGRESP latency accounting keep the same semantics; they now use the owning client state instead of process-global mutable state. Performance impact: negligible. The change adds one buffer_fragment in each client and one pointer in the private transaction buffer, with no new allocation, lock, atomic operation, loop, or network-path work. (cherry picked from commit e154e1c)
Contributor
There was a problem hiding this comment.
No issues found across 5 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Thread as Worker Threads
participant Capas as Agent Capabilities
participant ClientA as MQTT Client A
participant ClientB as MQTT Client B
participant BufA as Transaction Buffer A
participant BufB as Transaction Buffer B
participant PingFragA as A's Ping Fragment
participant PingFragB as B's Ping Fragment
participant WebRTC as WebRTC Data Channel
participant Sleep as tinysleep()
participant OS as OS Scheduler
Note over Thread,Capas: NEW: Thread‑local capability array
Thread->>Capas: aclk_get_agent_capas()
Capas-->>Thread: Pointer to __thread array
Note over ClientA,PingFragA: NEW: Per‑client ping fragment (was global)
ClientA->>ClientA: mqtt_ng_init()
ClientA->>PingFragA: Initialize frag with pingreq data
ClientA->>BufA: stable_frag = &client->ping_frag
ClientB->>ClientB: mqtt_ng_init()
ClientB->>PingFragB: Initialize frag
ClientB->>BufB: stable_frag = &client->ping_frag
loop Send cycle (Client A)
ClientA->>ClientA: mqtt_ng_next_to_send()
alt Ping pending
ClientA->>PingFragA: Reset sent, sent_monotonic_ut
ClientA->>BufA: sending_frag = &ping_frag
end
ClientA->>BufA: send_fragment()
BufA->>BufA: Check if frag != stable_frag
alt Is ping frag? (frag == stable_frag)
BufA->>BufA: Keep frag pointer (no invalidation)
else Other frag
BufA->>BufA: Invalidate sending_frag on compaction/growth
end
end
Note over WebRTC: CHANGED: Atomic store for open flag
WebRTC->>WebRTC: myOpenCallback()
WebRTC->>WebRTC: __atomic_store_n(&chan->open, true, RELAXED)
Note right of WebRTC: Readers use __atomic_load_n
Note over Sleep,OS: CHANGED: Windows sleep 1ms (was busy‑wait)
Sleep->>Sleep: tinysleep()
Sleep->>OS: Sleep(1) → yield for ~1ms
OS-->>Sleep: resume after sleep
Note over ClientB,PingFragB: Same per‑client pattern for Client B
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Summary
Summary by cubic
Fixes several data races and a Windows sleep issue across ACLK MQTT, capabilities, and WebRTC. Improves thread safety and prevents high-CPU spins, with no functional changes.
tinysleep()useSleep(1)instead ofSleep(0)to actually sleep; update comment to match behavior.open = truewith a relaxed atomic to match readers/writers and remove the data race.Written for commit 8819ec0. Summary will update on new commits.