{{ message }}
Netdata fixes part 26#22959
Draft
stelfrag wants to merge 9 commits into
Draft
Conversation
The USE_NOTRACE backend reports stacktrace_available() as false, but impl_stacktrace_get_frames() still manufactured one synthetic frame from a monotonically increasing atomic counter. stacktrace_get() interns every non-empty frame set in the global Judy cache, and that cache is not normally reclaimed. Because the synthetic frame was unique on every call, none-backend stacktrace tracking defeated cache deduplication and grew memory without bound. This is a current bug in exercised none-backend diagnostic builds whenever stacktrace_get() or stacktrace_array_add() is called, such as the existing FSANITIZE_ADDRESS dictionary and string tracking paths. Builds with real stacktrace backends are not affected by this backend-specific path. Return zero frames from the none backend and consume the unused parameters explicitly. stacktrace_get() already maps zero frames to NULL, stacktrace_array_add() already maps NULL to false, and the existing callers tolerate the no-stacktrace path. Direct stacktrace_capture() output remains unchanged. Functional impact: no change for real backends or direct no-backend capture output; in none-backend tracking paths, useless fake per-call stacktrace IDs disappear and the already-supported no-stacktrace path is used. Performance impact: positive for none-backend tracking paths because each attempted stacktrace avoids an atomic increment, hash, spinlock, Judy lookup/insert, and allocation; other backends are unchanged. (cherry picked from commit c8e779b)
ARAL mmap page creation builds MAP_SHARED backing filenames from ar->aral_lock.file_number. Page creation happens before the page-list lock is reacquired, and lockless ARALs skip those locks entirely. The previous non-atomic increment plus later read could race between concurrent mmap page creators; two distinct ARAL pages could choose the same filename and alias the same backing storage. Use a relaxed atomic add-fetch and format the filename from the returned value. The counter only needs uniqueness, not memory ordering, so the relaxed operation is sufficient and matches the existing counter style in this file. Current vs potential bug: the race exists in the ARAL allocator contract today, but current registry mmap users appear externally serialized and mmap is disabled by default, so it is latent for the default production configuration. The fix hardens the allocator before a future mmap-backed concurrent caller can corrupt memory. Functional impact: no intended behavior change. Filename format and monotonic numbering are preserved; the only changed behavior is that concurrent page creation can no longer reuse a filename. Performance impact: one relaxed atomic increment is added only to the mmap page-creation path. This is cold compared with normal ARAL allocation and negligible beside allocation, filename creation, and mmap work. (cherry picked from commit db380da)
MEM-parsers-002 is a current-path duration formatter bug with potential extreme-input faults: duration_snprintf() negated signed int64_t values directly, which is undefined for INT64_MIN, and multiplied signed values by unit nanosecond multipliers, which can overflow int64_t for large values and larger units. Current vs potential: duration_snprintf() is used by current runtime paths for streaming, ACLK, retention, config, buffers, SQLite metadata, and collectors. The faulty path is current, but the dangerous inputs are extreme int64_t values rather than normal duration values. Functional impact: normal duration formatting is unchanged for well-defined inputs. Extreme values that previously invoked undefined or wrapped signed arithmetic now format correctly on targets with unsigned __int128, or return the existing -3 error on fallback targets when the value cannot fit in uint64 nanoseconds. Performance impact: negligible for current callers because this formats human-readable durations, not per-sample hot-path data. The fix adds bounded integer arithmetic and branch checks only. Solution: compute negative magnitudes without negating INT64_MIN, use unsigned formatter arithmetic, guard fallback multiplication without __int128, round with quotient/remainder arithmetic, and add boundary tests for the formatter. (cherry picked from commit c3a29f6)
The deadly-signal log path and its signal-code helper were re-scanning stack buffers with strlen() after already constructing bounded, null-terminated strings. This is a current low-severity crash-path defect because these paths run from the signal handler, where strlen() is not async-signal-safe. Use the maintained byte length for the stderr write, and use strcatz() in SIGNAL_CODE_2str_h() so the helper gets the copied length directly while preserving the same bounded copy and terminator behavior. Functional impact: none; the emitted crash message, formatted signal-code string, signal chaining, Sentry reporting, status-file handling, reset-to-default, and re-raise behavior are unchanged. Performance impact: negligible improvement on rare crash/report paths by removing redundant stack-buffer scans. (cherry picked from commit 4fa81f8)
The ARAL mmap page deletion path logged unlink failures as a bare filename, while the sibling leftover-file cleanup path includes both the ARAL prefix and the allocator name. The code already has the ARAL object in scope at the failure site, so operators could lose the allocator context needed to identify which ARAL instance failed to delete its backing file. This is a current diagnostics defect on the existing unlink failure path, not a memory-lifetime or control-flow defect. The solution keeps the unlink call, branch condition, mmap/free order, locking, and caller behavior unchanged, and only formats the existing unlikely error log with ar->config.name to match the local ARAL diagnostic pattern. Functional impact: no success-path or failure-control-flow change; only the text of the existing unlink-failure log becomes more specific. Performance impact: effectively zero; one already-live string argument is formatted only when unlink() fails. (cherry picked from commit aaadd71)
STREAM_CIRCULAR_BUFFER_STATS stored four byte-size snapshots as uint32_t even though the circular-buffer implementation uses size_t for the corresponding sizes. Buffers above UINT32_MAX would truncate these snapshots and corrupt buffer_ratio and diagnostic log values. This is a current latent implementation defect with potential runtime manifestation. It requires a configured or autoscaled stream buffer above 4 GiB, but the truncation is deterministic once that size is reached. Widen the stats snapshot fields to size_t and update the coupled log format specifiers to %zu. In the sender overflow path, copy the stats while still holding the sender lock before logging after unlock, preserving the existing safe snapshot pattern. Functional impact: none intended except preserving the full size range in internal stats and logs. No streaming protocol, metric schema, disk format, API, config parser, or documentation surface changes in this commit. Performance impact: negligible. On 64-bit builds the stats struct grows by 16 bytes per stream circular buffer; the change adds no allocations, loops, atomics, syscalls, or runtime complexity. (cherry picked from commit e2d4b56)
The stream receiver byte counters bytes_in and bytes_out are 64-bit single-writer counters read lock-free by the pulse traversal. On 32-bit layouts, source review and reduced layout checks showed these fields could be placed at offsets not divisible by 8 while GCC can inline relaxed 64-bit access when it proves the target lock-free. This is a current source-level alignment bug with potential runtime manifestation on affected 32-bit builds. The writer and reader call paths exist today through stream-receiver.c and pulse-parents.c. Force the two current 64-bit single-writer counter fields to 8-byte alignment and update the helper contract comment so future 64-bit users carry the same requirement. The single-writer macro bodies, increments, reads, locks, lifetimes, chart names, dimensions, and algorithms are unchanged. Functional impact: none intended except making the existing relaxed atomic counter contract valid on affected layouts. The affected fields are internal in-memory host status, not serialized or public ABI state. Performance impact: no hot-path code generation change from this diff. On 64-bit platforms this alignment should be a no-op; on affected 32-bit layouts the internal status struct may gain padding. (cherry picked from commit 757d386)
rfc3339_datetime_ut() checked only payload capacity before each append. When a caller supplied an exact-fit buffer for the visible payload, for example 20 bytes for 'YYYY-MM-DDTHH:MM:SSZ', the formatter could write the final payload byte and then the finish block would overwrite that byte with NUL while returning the pre-clobber length. Current vs potential: this is current for direct callers of the public formatter API with undersized exact-fit buffers. Current in-tree production callers use RFC3339_MAX_LENGTH or larger, so they do not exercise the bad exact-fit case today. Solution: change each append guard to require room for the payload plus a trailing terminator, then terminate at buffer[pos] unconditionally. This preserves full-size output and makes undersized exact-fit output report the visible prefix length that actually remains in the buffer. Functional impact: none for current production callers with adequate buffers. The intended behavior change is limited to undersized buffers, which now get a coherent NUL-terminated prefix instead of a clobbered final byte and mismatched length. Performance impact: none expected. The change is only constant-time capacity predicates and a simpler finish block. (cherry picked from commit 20d40e9)
Several RRD functions paths built stack buffers from runtime string lengths: function registration and deletion sanitized the function name into char arrays sized by strlen(name), host_functions_to_dict() built versioned dictionary keys from strlen(t_dfe.name), and rrd_function_run() sanitized the caller source into a runtime-sized stack buffer. These are potential stack-exhaustion defects when function/source strings are large, even though ordinary plugin inputs are much smaller. Replace the runtime-sized stack buffers with exact-size temporary heap buffers using the existing CLEAN_CHAR_P cleanup convention. In rrd_function_run(), transfer the sanitized source buffer directly into the inflight request and null the cleanup pointer, avoiding an extra strdupz() on the success path. Dictionary keys, sanitized bytes, stream function-delete names, callback behavior, and JSON/API output remain unchanged. Current vs potential bug: potential in normal deployments, source-confirmed for these code paths because the allocation size is derived from runtime strings. Functional impact: no intended behavior change except avoiding unbounded stack allocation. Performance impact: negligible on non-hot registration/export/run paths; add/delete/export use one temporary heap allocation, while rrd_function_run() keeps the success-path allocation count effectively unchanged by removing the separate source strdup. (cherry picked from commit 5d3e096)
Contributor
There was a problem hiding this comment.
No issues found across 17 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Sender as Sender Thread
participant State as Sender State
participant SCB as Circular Buffer
participant Receiver as Receiver Thread
participant Host as RRDHOST
participant Log as Log System
Note over Sender, Log: Streaming send path – happy & overflow
Sender->>State: sender_buffer_commit(wb)
State->>SCB: adapt_max_size()
Note over SCB: CHANGED: stats fields are now size_t (was uint32_t)
State->>SCB: update stats (bytes_size, bytes_max_size…)
alt Normal commit
State->>SCB: add data
else Overflow detected
Note over State: CHANGED: snapshot stats BEFORE unlock
State->>SCB: copy stats snapshot
State->>Log: log with snapshot values, %zu format
Note over State: CHANGED: log uses snapshot, avoids race
State->>Receiver: send RESTART opcode
end
Receiver->>SCB: stream_receiver_handle_op()
SCB-->>Receiver: stats (size_t fields)
Receiver->>Log: log buffer full, %zu format
Note over Receiver: CHANGED: new format specifiers
Note over Host: CHANGED: bytes_in/bytes_out aligned to 8 bytes<br/>for safe 32-bit relaxed atomic access
Receiver->>Host: update bytes_in/bytes_out (single-writer atomic)
Host-->>Receiver: (via pulse traversal)
| char sanitized_source[(source ? strlen(source) : 0) + 1]; | ||
| rrd_functions_sanitize(sanitized_source, source ? source : "", sizeof(sanitized_source)); | ||
| const char *source_to_sanitize = source ? source : ""; | ||
| size_t sanitized_source_size = strlen(source_to_sanitize) + 1; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Summary
Summary by cubic
Hardened core paths and streaming by fixing memory growth, integer/buffer edge cases, and unsafe signal/stacktrace code. Improves correctness on 32-bit builds and large streams without changing normal behavior.
SIGNAL_CODE_2str_h()usesstrcatz()to get length safely.__int128when available) and guard fallback; added unit tests.size_tand switch logs to%zu; take a stats snapshot before unlocking in overflow logs; updated sender/receiver messages.Written for commit c818126. Summary will update on new commits.