Netdata fixes part 23 by stelfrag · Pull Request #22955 · netdata/netdata · GitHub
Skip to content

Netdata fixes part 23#22955

Draft
stelfrag wants to merge 5 commits into
netdata:masterfrom
stelfrag:netdata-fixes-23
Draft

Netdata fixes part 23#22955
stelfrag wants to merge 5 commits into
netdata:masterfrom
stelfrag:netdata-fixes-23

Conversation

@stelfrag

@stelfrag stelfrag commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator
Summary

Summary by cubic

Strengthens trace/debug builds: fixes rwlock trace races and API mismatches, hardens the traced allocator with a reentrancy-safe dlsym bootstrap and pointer-ownership checks, and adds Gorilla fuzz coverage for disk buffers. No production behavior changes.

  • Bug Fixes

    • NETDATA_TRACE_RWLOCKS: hold lockers_mutex while looking up/updating/adding lockers to avoid races; switch to nd_thread_tag and *_ITEM_UNSAFE list macros.
    • Align debug wrappers: make destroy/lock/unlock wrappers void (trylocks stay int) and drop impossible return checks, including in avl_destroy_lock.
    • Trace allocator: compute per-allocation header magic to validate ownership and avoid freeing foreign pointers; add a dlsym bootstrap allocator to prevent recursion during first-use symbol resolution (covers malloc/calloc/realloc/reallocarray/free/strdup/strndup/usable_size).
  • New Features

    • Fuzzing: add Gorilla disk-page patch/read path (gorilla_buffer_patchgorilla_reader_init) with bounded reads to improve coverage; production is unaffected.

Written for commit 6e1fb43. Summary will update on new commits.

Review in cubic

ktsaou added 5 commits July 1, 2026 14:56
NETDATA_TRACE_RWLOCKS had a potential debug-build use-after-free: add_rwlock_locker() looked up an existing locker under lockers_mutex, released the mutex, then updated the returned locker pointer.

Normal builds are unaffected because NETDATA_TRACE_RWLOCKS is only enabled through explicit trace CFLAGS. In trace builds, the race could corrupt debug bookkeeping if destroy cleanup removed the same locker while add_rwlock_locker() updated it.

Keep the Judy lookup and locker refcount/type update inside lockers_mutex, and insert new lockers under the same critical section. This adds no default-build functional impact and only extends the existing trace-mode bookkeeping mutex over first-entry allocation.

(cherry picked from commit 58ccab4)
Potential issue: NETDATA_TRACE_ALLOCATIONS used a fixed 0x0BADCAFE header magic to decide whether freez, reallocz, and mallocz_usable_size owned a pointer. A foreign libc pointer with that word at the expected offset could be misclassified and sent through the trace-accounting path.

Functional impact: normal builds do not enable this trace allocator path. Traced builds keep the same fallback behavior for foreign pointers and valid Netdata trace allocations.

Performance impact: ownership validation remains O(1) and keeps the same header size. The check adds only a pointer xor/fold against a static salt.

Store an address-derived magic in traced allocation headers and verify the computed value before using trace metadata or freeing the computed header.

(cherry picked from commit aaab1fe)
Potential: trace allocation builds resolve libc allocator pointers with dlsym(RTLD_NEXT) during first use. If dlsym re-enters an interposed allocation wrapper before the real pointer is installed, the resolver can recurse instead of reaching libc.

Functional impact: default and non-trace builds are unchanged because the change is inside NETDATA_TRACE_ALLOCATIONS plus dlsym guards. Trace+dlsym builds now service allocator calls made during dlsym from a fixed bootstrap buffer and avoid passing retained bootstrap pointers into traced or libc realloc paths.

Performance impact: default builds have no runtime cost. Trace+dlsym builds add only unlikely wrapper checks and a one-time thread-local depth guard around first-run dlsym resolution.

(cherry picked from commit f0b465c)
NETDATA_TRACE_RWLOCKS debug wrappers still modeled destroy, lock, and unlock helpers as returning int, even though the underlying __netdata_* helpers and libuv operations are void. That stale contract also left the trace path using the removed netdata_thread_tag() name and old linked-list macro names.

This is a current opt-in build bug: compiling with NETDATA_TRACE_RWLOCKS reaches these wrappers today and fails on void return-value use and stale APIs. It is potential runtime risk only for trace builds because the affected code is behind the trace/debug configuration.

Align the trace debug wrapper signatures with the existing base helper contracts: destroy, lock, and unlock wrappers are void; init and trylock wrappers keep their int return values. Replace the stale thread-tag and linked-list helper names with current project APIs.

Remove the directly coupled avl_destroy_lock() status check for netdata_rwlock_destroy(). The rwlock destroy operation cannot report failure in either trace or non-trace mode, so the check was an impossible contract and would remain a compile error once the trace wrapper is corrected.

Functional impact: default builds are unchanged. Trace/debug builds preserve the same lock operations, timing samples, locker bookkeeping, fatal unlock checks, and trylock failure behavior; they only stop exposing fabricated status values for operations that cannot return status.

Performance impact: none for default builds. Trace/debug builds keep the same instrumentation work, minus storing and formatting fabricated zero return values for void operations.

(cherry picked from commit f48449b)
Current issue: the existing ENABLE_FUZZER entry point exercised writer-produced gorilla buffers and reader roundtrips, but it did not feed arbitrary serialized disk-shaped buffers through gorilla_buffer_patch() and the disk-reader initialization path. This is a current coverage gap, not a production behavior bug.

Root cause: disk page repair and reader setup are normally reached from the dbengine disk-page load path, while the fuzzer only generated values through gorilla_writer_write(). Arbitrary corrupt disk bytes therefore did not exercise gorilla_buffer_patch(), gorilla_reader_init(), or malformed-reader termination.

Solution: add a fuzzer-only helper that copies the libFuzzer input into zero-padded gorilla-sized buffers, calls gorilla_buffer_patch(), and, on successful patching, initializes a gorilla reader and performs a bounded number of reads. The existing writer/reader roundtrip path remains unchanged and still runs for inputs with at least one uint32_t.

Functional impact: none for production builds because the change is entirely inside ENABLE_FUZZER. The libFuzzer entry point still returns 0 and still tolerates arbitrary input; malformed disk buffers are accepted as normal fuzzer inputs by returning from patch failure or breaking on reader failure.

Performance impact: none for production. Fuzzer-only work adds one input-sized allocation/copy rounded to gorilla buffer size and at most 16384 reader attempts after a successful patch, preventing attacker-controlled entry counts from creating long fuzz cases.

Caller and ownership impact: LLVMFuzzerTestOneInput is the only caller of the new helper. The helper mutates only a private copy of the input, owns its local Storage allocations, frees them before returning, adds no shared state, locks, or concurrency, and does not change dbengine page-loading callers.

(cherry picked from commit 6c1686f)
@sonarqubecloud

sonarqubecloud Bot commented Jul 3, 2026

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Confidence score: 2/5

  • In src/libnetdata/gorilla/gorilla.cc, the entries count can stay 0 if the single-argument gorilla_buffer_patch(disk_buffer) return value is not captured, which risks dropping valid data/state during patching and causing incorrect downstream behavior. Update this call path to assign and use the returned entry count (and verify with a focused test) before merging.
Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/libnetdata/gorilla/gorilla.cc">

<violation number="1" location="src/libnetdata/gorilla/gorilla.cc:508">
P0: `entries` is set by `gorilla_buffer_patch(disk_buffer)` (return value) but remains 0 when patch is called with wrong args — after the fix, the return value of the single-arg `gorilla_buffer_patch` must be captured. The current call assigns `entries` via the nonexistent third argument.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0: entries is set by gorilla_buffer_patch(disk_buffer) (return value) but remains 0 when patch is called with wrong args — after the fix, the return value of the single-arg gorilla_buffer_patch must be captured. The current call assigns entries via the nonexistent third argument.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/libnetdata/gorilla/gorilla.cc, line 508:

<comment>`entries` is set by `gorilla_buffer_patch(disk_buffer)` (return value) but remains 0 when patch is called with wrong args — after the fix, the return value of the single-arg `gorilla_buffer_patch` must be captured. The current call assigns `entries` via the nonexistent third argument.</comment>

<file context>
@@ -489,7 +489,44 @@ class Storage {
+    gorilla_buffer_t *disk_buffer = S.alloc_buffer(words);
+    memcpy(disk_buffer, data, size);
+
+    uint32_t entries = 0;
+    if (gorilla_buffer_patch(disk_buffer, nbuffers, &entries)) {
+        gorilla_reader_t gr = gorilla_reader_init(disk_buffer);
</file context>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants