Netdata fixes part 26 by stelfrag · Pull Request #22959 · netdata/netdata · GitHub
Skip to content

Netdata fixes part 26#22959

Draft
stelfrag wants to merge 9 commits into
netdata:masterfrom
stelfrag:netdata-fixes-26
Draft

Netdata fixes part 26#22959
stelfrag wants to merge 9 commits into
netdata:masterfrom
stelfrag:netdata-fixes-26

Conversation

@stelfrag

@stelfrag stelfrag commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator
Summary

Summary by cubic

Hardened core paths and streaming by fixing memory growth, integer/buffer edge cases, and unsafe signal/stacktrace code. Improves correctness on 32-bit builds and large streams without changing normal behavior.

  • Bug Fixes
    • Stacktrace: none backend now returns 0 frames to stop unbounded cache growth.
    • Signals: avoid strlen in the crash path; write() uses known length; SIGNAL_CODE_2str_h() uses strcatz() to get length safely.
    • Date/time: RFC3339 formatter reserves space for the terminator; exact-fit buffers now produce a correct, terminated prefix; added unit test.
    • Durations: eliminate signed overflow and INT64_MIN negate; format via unsigned math (uses __int128 when available) and guard fallback; added unit tests.
    • Streaming: widen circular buffer stat fields to size_t and switch logs to %zu; take a stats snapshot before unlocking in overflow logs; updated sender/receiver messages.
    • Concurrency/alignment: enforce 8-byte alignment for 64-bit single-writer counters; clarify macro contract.
    • ARAL: atomically increment mmap page file numbers to avoid filename races; include allocator name in unlink error logs.
    • RRD functions: replace runtime-sized stack buffers with exact-size heap allocations for keys/sources; bound-check function key sizing; pass sanitized source without extra strdup.

Written for commit c818126. Summary will update on new commits.

Review in cubic

ktsaou added 9 commits July 1, 2026 14:56
The USE_NOTRACE backend reports stacktrace_available() as false, but impl_stacktrace_get_frames() still manufactured one synthetic frame from a monotonically increasing atomic counter. stacktrace_get() interns every non-empty frame set in the global Judy cache, and that cache is not normally reclaimed. Because the synthetic frame was unique on every call, none-backend stacktrace tracking defeated cache deduplication and grew memory without bound.

This is a current bug in exercised none-backend diagnostic builds whenever stacktrace_get() or stacktrace_array_add() is called, such as the existing FSANITIZE_ADDRESS dictionary and string tracking paths. Builds with real stacktrace backends are not affected by this backend-specific path.

Return zero frames from the none backend and consume the unused parameters explicitly. stacktrace_get() already maps zero frames to NULL, stacktrace_array_add() already maps NULL to false, and the existing callers tolerate the no-stacktrace path. Direct stacktrace_capture() output remains unchanged.

Functional impact: no change for real backends or direct no-backend capture output; in none-backend tracking paths, useless fake per-call stacktrace IDs disappear and the already-supported no-stacktrace path is used.

Performance impact: positive for none-backend tracking paths because each attempted stacktrace avoids an atomic increment, hash, spinlock, Judy lookup/insert, and allocation; other backends are unchanged.

(cherry picked from commit c8e779b)
ARAL mmap page creation builds MAP_SHARED backing filenames from ar->aral_lock.file_number. Page creation happens before the page-list lock is reacquired, and lockless ARALs skip those locks entirely. The previous non-atomic increment plus later read could race between concurrent mmap page creators; two distinct ARAL pages could choose the same filename and alias the same backing storage.

Use a relaxed atomic add-fetch and format the filename from the returned value. The counter only needs uniqueness, not memory ordering, so the relaxed operation is sufficient and matches the existing counter style in this file.

Current vs potential bug: the race exists in the ARAL allocator contract today, but current registry mmap users appear externally serialized and mmap is disabled by default, so it is latent for the default production configuration. The fix hardens the allocator before a future mmap-backed concurrent caller can corrupt memory.

Functional impact: no intended behavior change. Filename format and monotonic numbering are preserved; the only changed behavior is that concurrent page creation can no longer reuse a filename.

Performance impact: one relaxed atomic increment is added only to the mmap page-creation path. This is cold compared with normal ARAL allocation and negligible beside allocation, filename creation, and mmap work.

(cherry picked from commit db380da)
MEM-parsers-002 is a current-path duration formatter bug with potential extreme-input faults: duration_snprintf() negated signed int64_t values directly, which is undefined for INT64_MIN, and multiplied signed values by unit nanosecond multipliers, which can overflow int64_t for large values and larger units.

Current vs potential: duration_snprintf() is used by current runtime paths for streaming, ACLK, retention, config, buffers, SQLite metadata, and collectors. The faulty path is current, but the dangerous inputs are extreme int64_t values rather than normal duration values.

Functional impact: normal duration formatting is unchanged for well-defined inputs. Extreme values that previously invoked undefined or wrapped signed arithmetic now format correctly on targets with unsigned __int128, or return the existing -3 error on fallback targets when the value cannot fit in uint64 nanoseconds.

Performance impact: negligible for current callers because this formats human-readable durations, not per-sample hot-path data. The fix adds bounded integer arithmetic and branch checks only.

Solution: compute negative magnitudes without negating INT64_MIN, use unsigned formatter arithmetic, guard fallback multiplication without __int128, round with quotient/remainder arithmetic, and add boundary tests for the formatter.
(cherry picked from commit c3a29f6)
The deadly-signal log path and its signal-code helper were re-scanning stack buffers with strlen() after already constructing bounded, null-terminated strings. This is a current low-severity crash-path defect because these paths run from the signal handler, where strlen() is not async-signal-safe.

Use the maintained byte length for the stderr write, and use strcatz() in SIGNAL_CODE_2str_h() so the helper gets the copied length directly while preserving the same bounded copy and terminator behavior.

Functional impact: none; the emitted crash message, formatted signal-code string, signal chaining, Sentry reporting, status-file handling, reset-to-default, and re-raise behavior are unchanged.

Performance impact: negligible improvement on rare crash/report paths by removing redundant stack-buffer scans.

(cherry picked from commit 4fa81f8)
The ARAL mmap page deletion path logged unlink failures as a bare filename, while the sibling leftover-file cleanup path includes both the ARAL prefix and the allocator name. The code already has the ARAL object in scope at the failure site, so operators could lose the allocator context needed to identify which ARAL instance failed to delete its backing file.

This is a current diagnostics defect on the existing unlink failure path, not a memory-lifetime or control-flow defect. The solution keeps the unlink call, branch condition, mmap/free order, locking, and caller behavior unchanged, and only formats the existing unlikely error log with ar->config.name to match the local ARAL diagnostic pattern.

Functional impact: no success-path or failure-control-flow change; only the text of the existing unlink-failure log becomes more specific.

Performance impact: effectively zero; one already-live string argument is formatted only when unlink() fails.

(cherry picked from commit aaadd71)
STREAM_CIRCULAR_BUFFER_STATS stored four byte-size snapshots as uint32_t even though the circular-buffer implementation uses size_t for the corresponding sizes. Buffers above UINT32_MAX would truncate these snapshots and corrupt buffer_ratio and diagnostic log values.

This is a current latent implementation defect with potential runtime manifestation. It requires a configured or autoscaled stream buffer above 4 GiB, but the truncation is deterministic once that size is reached.

Widen the stats snapshot fields to size_t and update the coupled log format specifiers to %zu. In the sender overflow path, copy the stats while still holding the sender lock before logging after unlock, preserving the existing safe snapshot pattern.

Functional impact: none intended except preserving the full size range in internal stats and logs. No streaming protocol, metric schema, disk format, API, config parser, or documentation surface changes in this commit.

Performance impact: negligible. On 64-bit builds the stats struct grows by 16 bytes per stream circular buffer; the change adds no allocations, loops, atomics, syscalls, or runtime complexity.

(cherry picked from commit e2d4b56)
The stream receiver byte counters bytes_in and bytes_out are 64-bit single-writer counters read lock-free by the pulse traversal. On 32-bit layouts, source review and reduced layout checks showed these fields could be placed at offsets not divisible by 8 while GCC can inline relaxed 64-bit access when it proves the target lock-free.

This is a current source-level alignment bug with potential runtime manifestation on affected 32-bit builds. The writer and reader call paths exist today through stream-receiver.c and pulse-parents.c.

Force the two current 64-bit single-writer counter fields to 8-byte alignment and update the helper contract comment so future 64-bit users carry the same requirement. The single-writer macro bodies, increments, reads, locks, lifetimes, chart names, dimensions, and algorithms are unchanged.

Functional impact: none intended except making the existing relaxed atomic counter contract valid on affected layouts. The affected fields are internal in-memory host status, not serialized or public ABI state.

Performance impact: no hot-path code generation change from this diff. On 64-bit platforms this alignment should be a no-op; on affected 32-bit layouts the internal status struct may gain padding.

(cherry picked from commit 757d386)
rfc3339_datetime_ut() checked only payload capacity before each append. When a caller supplied an exact-fit buffer for the visible payload, for example 20 bytes for 'YYYY-MM-DDTHH:MM:SSZ', the formatter could write the final payload byte and then the finish block would overwrite that byte with NUL while returning the pre-clobber length.

Current vs potential: this is current for direct callers of the public formatter API with undersized exact-fit buffers. Current in-tree production callers use RFC3339_MAX_LENGTH or larger, so they do not exercise the bad exact-fit case today.

Solution: change each append guard to require room for the payload plus a trailing terminator, then terminate at buffer[pos] unconditionally. This preserves full-size output and makes undersized exact-fit output report the visible prefix length that actually remains in the buffer.

Functional impact: none for current production callers with adequate buffers. The intended behavior change is limited to undersized buffers, which now get a coherent NUL-terminated prefix instead of a clobbered final byte and mismatched length.

Performance impact: none expected. The change is only constant-time capacity predicates and a simpler finish block.

(cherry picked from commit 20d40e9)
Several RRD functions paths built stack buffers from runtime string lengths: function registration and deletion sanitized the function name into char arrays sized by strlen(name), host_functions_to_dict() built versioned dictionary keys from strlen(t_dfe.name), and rrd_function_run() sanitized the caller source into a runtime-sized stack buffer. These are potential stack-exhaustion defects when function/source strings are large, even though ordinary plugin inputs are much smaller.

Replace the runtime-sized stack buffers with exact-size temporary heap buffers using the existing CLEAN_CHAR_P cleanup convention. In rrd_function_run(), transfer the sanitized source buffer directly into the inflight request and null the cleanup pointer, avoiding an extra strdupz() on the success path. Dictionary keys, sanitized bytes, stream function-delete names, callback behavior, and JSON/API output remain unchanged.

Current vs potential bug: potential in normal deployments, source-confirmed for these code paths because the allocation size is derived from runtime strings. Functional impact: no intended behavior change except avoiding unbounded stack allocation. Performance impact: negligible on non-hot registration/export/run paths; add/delete/export use one temporary heap allocation, while rrd_function_run() keeps the success-path allocation count effectively unchanged by removing the separate source strdup.

(cherry picked from commit 5d3e096)

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 17 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.
Architecture diagram
sequenceDiagram
    participant Sender as Sender Thread
    participant State as Sender State
    participant SCB as Circular Buffer
    participant Receiver as Receiver Thread
    participant Host as RRDHOST
    participant Log as Log System

    Note over Sender, Log: Streaming send path – happy & overflow

    Sender->>State: sender_buffer_commit(wb)
    State->>SCB: adapt_max_size()
    Note over SCB: CHANGED: stats fields are now size_t (was uint32_t)
    State->>SCB: update stats (bytes_size, bytes_max_size…)
    alt Normal commit
        State->>SCB: add data
    else Overflow detected
        Note over State: CHANGED: snapshot stats BEFORE unlock
        State->>SCB: copy stats snapshot
        State->>Log: log with snapshot values, %zu format
        Note over State: CHANGED: log uses snapshot, avoids race
        State->>Receiver: send RESTART opcode
    end

    Receiver->>SCB: stream_receiver_handle_op()
    SCB-->>Receiver: stats (size_t fields)
    Receiver->>Log: log buffer full, %zu format
    Note over Receiver: CHANGED: new format specifiers

    Note over Host: CHANGED: bytes_in/bytes_out aligned to 8 bytes<br/>for safe 32-bit relaxed atomic access
    Receiver->>Host: update bytes_in/bytes_out (single-writer atomic)
    Host-->>Receiver: (via pulse traversal)
Loading

Re-trigger cubic

@sonarqubecloud

sonarqubecloud Bot commented Jul 3, 2026

Copy link
Copy Markdown

char sanitized_source[(source ? strlen(source) : 0) + 1];
rrd_functions_sanitize(sanitized_source, source ? source : "", sizeof(sanitized_source));
const char *source_to_sanitize = source ? source : "";
size_t sanitized_source_size = strlen(source_to_sanitize) + 1;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants