fix: harden netdata memory safety edges#22881
Conversation
There was a problem hiding this comment.
No issues found across 15 files
Confidence score: 5/5
- Automated review surfaced no issues in the provided summaries.
- No files require special attention.
Architecture diagram
sequenceDiagram
participant Buffer as Buffer Module
participant Str as String Module
participant Parser as Procfile Parser
participant Socket as Socket/SSL Module
participant Stack as Stacktrace Module
participant Storage as Storage Number Module
participant HTTP as HTTP Access Module
Note over Buffer,HTTP: Memory Safety Edges (current state)
Buffer->>Buffer: buffer_strcat_htmlescape()
Buffer->>Buffer: NUL-terminate after HTML escape
Buffer->>Buffer: buffer_overflow_check()
Buffer->>Buffer: buffer_jsdate()
Buffer->>Buffer: memcpy "Date" const uint32_t
Buffer->>Buffer: memcpy ")\0" const uint16_t
Buffer->>Buffer: Calculate len via pointer arithmetic
Buffer->>Buffer: buffer_print_netdata_double_hex/base64()
Buffer->>Buffer: memcpy value to uint64_t representation
Buffer->>Buffer: Print hex/base64 from representation
Buffer->>Str: string_strdupz/string_strndupz(str)
alt length > STRING_MAX_LENGTH
Str->>Str: fatal() - cannot index
else valid length
Str->>Str: string_index_search/lookup
end
Str->>Str: string_2way_merge(a, b)
Str->>Str: Bound suffix scan with prefix_end pointers
Str->>Str: Compare s1[-1] vs s2[-1]
Parser->>Parser: procfile_set_separators/quotes/open_close()
Parser->>Parser: Index ffs[] with unsigned char
Socket->>Socket: connect_to_one_of_callback(entry)
Socket->>Socket: strncpyz with connected_to_size - 1
Socket->>Socket: NUL-terminate at connected_to_size - 1
Socket->>Socket: poll_events()
alt ndpl creation failed
Socket->>Socket: log error, return early
end
Socket->>Socket: security_test_certificate(ssl)
Socket->>Socket: X509_free(cert) after use
Stack->>Stack: stacktrace_array_add(array, skip_frames)
alt invalid skip_frames
Stack->>Stack: return false
end
Stack->>Stack: stacktrace_get(skip_frames)
alt invalid skip_frames
Stack->>Stack: return NULL
end
Stack->>Stack: impl_stacktrace_get_frames()
Stack->>Stack: bt_collect_frames_error_handler callback
Stack->>Stack: impl_stacktrace_to_buffer(trace)
Stack->>Stack: dladdr with real Dl_info struct
Stack->>Stack: access info.dli_sname
Storage->>Storage: is_system_ieee754_double()
Storage->>Storage: memcpy for type-punning
Storage->>Storage: Compare representation vs expected
HTTP->>HTTP: http_access2txt(buf, size, access)
alt buf is NULL or size is 0
HTTP->>HTTP: return early
end
Note over Buffer,HTTP: All paths avoid undefined behavior and type-punning violations
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
72899f2 to
7bc8110
Compare
|
@cubic-dev-ai please review again |
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
1 issue found across 2 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/libnetdata/sanitizers/utf8-sanitizer.c">
<violation number="1" location="src/libnetdata/sanitizers/utf8-sanitizer.c:159">
P1: Memory leak: early empty-input return does not free `src_to_free`</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
948e2d3 to
02e8b04
Compare
|
@cubic-dev-ai please review this PR |
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
1 issue found across 7 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/daemon/status-file.c">
<violation number="1" location="src/daemon/status-file.c:1068">
P2: `daemon_status_dedup_load(false)` return value is discarded, but the signal-handler path in `daemon_status_file_deadly_signal_received()` now relies on the dedup table having been preloaded into memory via `DEDUP_USE_LOADED`. If the disk file exists but fails to load (corrupt, permission error, etc.), duplicate Sentry crash reports can be sent because the handler skips the on-demand reload.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/libnetdata/socket/nd-poll.c">
<violation number="1" location="src/libnetdata/socket/nd-poll.c:298">
P1: The Linux epoll backend does not subscribe to EPOLLRDHUP while the non-Linux poll backend now does (via the added POLLRDHUP flag). Since the kernel masks epoll events by the subscribed mask (`ep_item_poll` returns `vfs_poll(...) & epi->event.events`), EPOLLRDHUP will not be delivered on Linux unless explicitly requested. This leaves the primary Linux path inconsistent with the poll fallback for detecting peer half-close events, particularly on write-only sockets.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
stacktrace_to_buffer() lazily publishes st->text while holding stacktrace_lock, but the cache-hit path read the pointer without the same synchronization. Concurrent conversion of the same interned stacktrace could therefore read a pointer while another thread writes it. This is a potential runtime data race with deterministic source evidence: the current code has a write-under-lock/read-without-lock pattern on the cached text pointer. The cached string is published once and source review found no free or reassignment path after publication. Load st->text under stacktrace_lock, release the lock, and append the copied pointer afterward. The lock is intentionally not held across buffer_strcat(), backend symbol resolution, caller-buffer growth, or formatting. Functional impact: no intended output change. Cache hits still append the same cached text, and cache misses still use the existing backend conversion and lazy publish path. Performance impact: one short stacktrace_lock acquire/release per stacktrace_to_buffer() call. The lock hold is pointer-only and avoids allocation or buffer work under the spinlock. Validation: git diff --check passed; the worker configured and built libnetdata with low priority. Full netdata/unit-test execution was blocked by an unrelated missing ACLK proto source.
stacktrace_cache_init() used double-checked locking with a plain cache_initialized bool. The flag was written under stacktrace_lock but read without synchronization on the fast path, so another thread could race with the publication of the Judy cache initialization state. This is a potential runtime data race with deterministic source evidence. stacktrace_init() and stacktrace_get() both rely on lazy idempotent cache initialization, and no source path resets the flag after initialization. Use acquire loads for both checks and a release store after STACKTRACE_INIT(). The existing spinlock and lazy initialization contract stay in place; no public stacktrace API or backend initialization sequence changes. Functional impact: no intended behavior change. The cache is still initialized once before use, and callers keep the same lazy/idempotent semantics. Performance impact: negligible; the post-initialization fast path replaces one plain bool load with one acquire atomic load and adds no lock acquisition. Validation: git diff --check passed; source scans confirmed no plain active cache_initialized reads or writes remain, and the worker built libnetdata with low priority. Full netdata/unit-test execution was blocked by an unrelated missing ACLK proto source.
stacktrace_get() handled a hash mismatch by probing at hash + i for at most ten slots. If those probed slots were occupied by other traces, the loop could exit with trace still pointing at a non-matching cached stacktrace and return the wrong frames. This is a current implementation defect in the cache logic, even though practical collision probability is low with the current hash. The function contract is exact stacktrace identity, not best-effort hash identity. Store same-key collisions as a linked chain under the original JudyL key and compare frame arrays before returning a cached trace. New traces are appended to the key chain only after exact mismatch, and Judy insertion failure returns NULL instead of publishing a partial cache entry. Functional impact: same frame sequences still return the same cached pointer; different frame sequences with the same hash now return distinct stacktrace objects instead of being conflated. Performance impact: normal operation remains one JudyL lookup and one frame comparison. Only true hash collisions walk an O(k) same-key chain; this also removes synthetic hash+i table pollution. Each cached stacktrace gains one next pointer. Validation: git diff --check passed; the worker configured with low priority and compiled the touched stacktrace objects. A unit test now exercises same-hash different-frame cache entries. Full netdata execution was blocked by an unrelated missing ACLK proto source.
simple_ring_buffer_make_room() compared b->write_pos + size against b->size using unchecked size_t arithmetic, then doubled or added to b->size without guarding overflow. A wrapped capacity calculation could make the allocation decision smaller than the bytes later copied into the buffer. The decompressor read path also trusted the backend-visible bytes-in-buffer state before memcpy(). If that state became inconsistent, stream_decompressor_get() could copy more bytes than the ring buffer currently held. This is a potential bug for current callers: source review found streaming compression inputs currently bounded by compression chunk/message limits, but the helpers accepted arbitrary size_t positions and sizes and performed unsafe arithmetic or insufficient local state validation. Compute the required capacity with checked addition, guard doubling at SIZE_MAX / 2, and ensure the selected new size is at least write_pos + size before reallocz(). Validate decompressor buffer/data invariants before reporting available bytes, and check available bytes before memcpy(). Impossible capacity or inconsistent-buffer requests now fail before allocation or copy. Functional impact: no intended change for representable normal streaming compression or decompression traffic. Valid requests still grow the same buffer, expose the same decompressed bytes, and preserve the existing reallocz ownership model. Performance impact: negligible. The helpers add scalar checks around buffer growth and decompressor copy boundaries, with no per-byte work and no normal-path allocation policy change for bounded callers. Validation: git diff --check passed; the worker configured with low priority and direct syntax/format checks passed for streaming compression translation units. The generated object build was blocked by an unrelated missing ACLK proto source.
stream_receiver_check_all_nodes_from_poll() probes each receiver with nd_sock_peek_nowait(). A signal-interrupted probe can return -1 with errno set to EINTR, but the previous code treated every negative result other than EAGAIN/EWOULDBLOCK as a socket error and disconnected the receiver. This is a potential runtime bug with deterministic source evidence: EINTR means the probe was interrupted, not that the child connection failed. The only caller is the stream thread's periodic receiver check path. Treat EINTR like the existing non-fatal no-data cases and leave EOF, ECONNRESET, timeout, and real socket-error handling unchanged. Functional impact: no intended change except replacing unsafe/error behavior with retry behavior for interrupted probes. Receiver scheduling, teardown, circular-buffer checks, and removal decisions are otherwise unchanged. Performance impact: negligible; this adds one integer comparison in a periodic per-receiver probe and adds no syscalls, allocations, locks, or loop iterations. Validation: git diff --check passed; the worker configured with low priority and directly compiled stream-receiver.c from compile_commands.json. The Ninja object target was blocked by an unrelated missing ACLK proto source.
stream_sender_execute_commands() appended deferred FUNCTION_PAYLOAD/JSON lines directly into the payload buffer while waiting for an end keyword. A sender could keep the deferred state open across reads, grow the buffer without enforcing the existing deferred-size limit, and rely on strcmp() over reconstructed lines for delimiter handling. This is a potential robustness and memory-growth bug in the current streaming sender parser. Valid deferred payloads are bounded by protocol expectations, but malformed or oversized input could keep expanding the payload until allocation failure or connection teardown elsewhere. Add bounded helper functions for exact end-keyword comparison, split end-keyword prefix preservation, and deferred payload append with PLUGINSD_MAX_DEFERRED_SIZE enforcement. Return parser failure to the only caller so the existing sender socket-error disconnect/reconnect path handles malformed oversized deferred payloads. Functional impact: valid FUNCTION_PAYLOAD and JSON deferred messages remain unchanged, including split terminator lines that are now preserved until the next read completes them. Oversized malformed payloads now fail deterministically instead of continuing unbounded growth. Performance impact: negligible; only the deferred parser path adds short strlen/memcmp checks around delimiter candidates and a length check before payload append. Non-deferred command parsing is unchanged. Validation: git diff --check passed; the worker configured with low priority and directly compiled the affected C files. The full netdata target was blocked before compilation by an unrelated missing ACLK proto source.
string_statistics() aggregated per-partition counters by reading string_base entries directly, while insert, delete, and destroy paths update or reset the same counters under the partition write lock. That leaves the pulse statistics path with unsynchronized reads of non-atomic size_t and long int fields. This is a potential bug: the pulse and unittest callers exercise the statistics reader today, and concurrent string insert/delete writers can run at the same time, but no specific runtime corruption was reproduced for this change. Take the matching per-partition read lock long enough to copy the scalar counters, then release it before updating caller-provided output pointers. The NETDATA_INTERNAL_CHECKS counters are loaded with the same relaxed atomic discipline used by their writers. Functional impact: none intended. The string_statistics() signature, output meanings, pulse chart dimensions, and unittest deltas stay unchanged; the snapshot now follows the existing partition locking contract. Performance impact: one short read lock per string partition on the pulse statistics path. The critical section copies scalar fields only, so insert/delete writers may wait briefly on the same partition during a statistics snapshot, while unrelated partitions are unaffected. Validation: git diff --cached --check passed; worker validation compiled string.c, built the libnetdata target, and confirmed the full netdata target is blocked by missing checkout content under src/aclk/aclk-schemas/proto/aclk/v1/lib.proto before this change is reached. Local reviewers verified all callers and approved the source-level contract.
The stream compression signature stores payload length in two 7-bit fields, so it can encode at most 14 bits. The sender already rejects compressed output at COMPRESSION_MAX_CHUNK, but that capacity relationship was implicit and a future constant increase or direct helper misuse could silently truncate length bits in the wire signature. This is a potential bug with current constants: normal sender output is bounded today because COMPRESSION_MAX_CHUNK is exactly one greater than the maximum encodable payload, but the encoder itself did not enforce that contract. Make the 14-bit signature capacity explicit, reject impossible encoder input before truncating it, and add a compile-time guard that keeps COMPRESSION_MAX_CHUNK within the signature encoding range. The decode math uses the same bit layout through named masks, and the existing stream compression unittest now checks the boundary values. Functional impact: none for valid current streaming flows. Encoded bytes, decoded lengths, sender retry behavior, receiver signature handling, and stream wire format remain unchanged for all lengths currently accepted by stream_compress(). Performance impact: negligible. The production path adds one scalar comparison after compression and one unlikely invariant branch in the inline encoder; there are no new allocations, copies, locks, or per-byte work. Validation: git diff --cached --check passed after cherry-pick onto the current branch. Worker validation passed low-priority CMake configure, direct object compiles for the affected streaming units, and a standalone signature boundary harness; the full Ninja object build was blocked by unrelated missing ACLK proto checkout content. Local source-review gate verified all callers and the unchanged wire contract.
string_2way_merge() lazily initialized the file-local [x] sentinel on first use by reading and writing string_2way_merge_X without synchronization. Two concurrent first callers could race on that global pointer and acquire extra interned-string references while cleanup releases only the single owned sentinel reference. This is a potential bug in current runtime paths: context merge callers can exercise string_2way_merge() after startup, but no deterministic reproduction was produced. The source race and possible leaked reference are clear from the unsynchronized lazy initialization. Initialize the sentinel once from string_init(), after all string partition locks are initialized, and remove the runtime lazy assignment from string_2way_merge(). string_destroy() continues to release the single owned sentinel reference and clear the pointer during sanitizer cleanup. Functional impact: none for valid initialized use. string_2way_merge() keeps the same signature, returned ownership, NULL handling, sentinel content, and merge output for all current callers. Performance impact: neutral to slightly positive. Runtime merge calls lose one unlikely branch and first-call allocation path; startup gains one small interned string allocation. Validation: git diff --cached --check passed after cherry-pick onto the current branch. Worker validation built libnetdata with low priority and ran a temporary string_init/string_unittest/string_destroy harness with 0 errors; full netdata build was blocked by unrelated missing ACLK proto checkout content. Local source-review gate verified startup, caller, and shutdown ordering.
http_access2id_one() lazily populated a hash field inside the process-global http_accesses[] table while parsing access names. That write had no lock, atomic operation, or startup-only guarantee, so concurrent first lookups could race on the non-atomic table slot. This is a potential bug: source review proves the helper is reachable from bearer/auth access parsing paths, but no production trace proving concurrent first lookup was part of this finding's evidence. The written value was deterministic, so the risk is the C data race rather than an intended permission mapping change. Remove the mutable cache field, make the table static const, and compute each candidate name hash on the stack during lookup. The strcmp confirmation, table ordering, permission names, and HTTP_ACCESS values stay unchanged. Functional impact: none intended. http_access2id_one(), http_access2id(), bearer JSON parsing, JSON array output, text output, unknown-token behavior, and bearer-token formats preserve the same caller-visible results. Performance impact: bounded extra CPU per parsed access token: at most the fixed access-name table is hashed with short strings. This avoids locks and atomics and is on auth/access parsing paths, not per-sample metric collection or storage hot paths. Validation: git diff --cached --check passed. Worker validation passed a low-priority minimal CMake configure, low-priority libnetdata build, and ctest discovery in the worker build tree. Local source-review gate verified all callers and same-file table consumers.
get_release_channel() lazily cached the release channel in a function-static int with plain reads and writes. Concurrent /api/v1/charts requests, analytics collection, or ACLK node-info generation could overlap on first use and race on that cache. This is a potential bug: source review proves concurrent web reachability through the static-threaded web server and /api/v1/charts, but no runtime crash or corruption reproducer was part of the finding evidence. Protect first initialization with a function-local spinlock and publish the cache with release/acquire atomics. The slow path still reads the same .environment file and falls back to the same NETDATA_VERSION heuristic; the fast path reads the initialized value without taking the lock. Functional impact: none intended. The function still returns the same stable/nightly string literals, the JSON release_channel field, analytics value, and ACLK node-info value keep the same semantics, and callers still copy or immediately serialize the returned string. Performance impact: first initialization serializes the existing one-time file parse. Steady state changes from a plain cached load to one atomic acquire load, with no lock, allocation, or file I/O after initialization. Validation: git diff --cached --check passed. Worker validation passed low-priority CMake configure and a targeted charts2json object compile. Local source-review gate verified all callers, callee lifetime/ownership, and the local spinlock/atomic pattern.
prometheus_server_last_access() cached a borrowed RRDHOST pointer in a process-global list that is only cleaned during exporter shutdown. Hosts can be unlinked and freed independently, leaving a dangling host address in the Prometheus last-access cache. This is a current bug as a stale stored address: the cached pointer is not dereferenced in this helper, but future lookups can depend on allocator address reuse and match the wrong host key after deletion/recreation. Store a copy of the host machine_guid in the cache entry and compare server plus machine_guid instead of server plus heap address. The cache still owns only the copied server string and the entry struct; cleanup remains unchanged because the GUID is an inline buffer. Functional impact: normal live-host Prometheus scrape behavior and public output are unchanged. The intentional correction is that host identity is keyed by the stable machine GUID rather than by a transient heap address after host lifetime ends. Performance impact: lookup keeps the same linear list scan and existing mutex. It replaces a pointer comparison with a fixed-size GUID string comparison after hash and server-name checks, and increases each cache entry by one GUID buffer; no new allocations or locks are added. Validation: git diff --cached --check passed. Worker validation passed CMake configure with unrelated unavailable components disabled and successfully compiled src/exporting/prometheus/prometheus.c through the generated compile command; full Ninja object target was blocked by unrelated missing ACLK proto checkout content. Local source-review gate verified caller chain, host lifetime, machine_guid stability, cleanup, and all-hosts behavior.
web_client_switch_host(), web_client_api_request_with_node_selection(), and web_client_process_url() lazily initialized function-static route hashes with plain writes on first request. These request paths are reachable from multiple web workers, ACLK proxy calls, and WebRTC API calls, so overlapping first requests could race on the shared hash variables. This is a current source bug: the routes are exercised by normal web/API requests today, and the first-use writes were not atomic or locked. No runtime crash report is required because the source race is deterministic. Add a private URL hash table initialized once behind a function-local spinlock and published with release/acquire atomics, then read the cached hashes through that table. Remove the dead localhost hash computation because the value was never read by the routing logic. Functional impact: none intended. Route strings, hash values, comparison order, callbacks, ACL checks, response codes, and response bodies are unchanged; only the unsafe first-use publication mechanism is replaced. Performance impact: first dispatcher use computes the hashes once under a spinlock. Steady-state requests pay one acquire load of the initialized flag and reuse cached uint32 hashes, matching the old intended compute-once fast path without the race. Validation: git diff --cached --check passed. Isolated validation passed same-failure rg scan, low-priority CMake configure, direct compile of web_client.c, and direct compile with NETDATA_INTERNAL_CHECKS; full Ninja object target was blocked by unrelated missing ACLK proto checkout content. Local source-review gate verified all routing callers and reviewer approvals.
RRDINSTANCE_ACQUIRED pins the context instance dictionary item, but the old rrdinstance_acquired_rrdset() helper returned the instance's raw RRDSET pointer without acquiring the chart dictionary item. Alert formatter and query-target paths then locked and traversed st->alerts through that raw pointer. This is a potential lifetime bug: the affected endpoints are exercised today, but the unsafe path requires a concurrent chart removal or relink while a query still holds only the acquired instance. In that race, the raw chart pointer can refer to a chart that is no longer linked to the acquired instance or is being torn down. Replace the raw-pointer helper with rrdinstance_acquired_rrdset_acquire(), which acquires the chart by host and id, verifies the acquired chart still points back to the exact acquired instance, and returns an RRDSET_ACQUIRED handle. Convert all existing alert scan callers to release that handle after the existing alert spinlock traversal. Functional impact: live charts that remain linked to the acquired instance produce the same alert counters and JSON. Removed or relinked charts are skipped instead of being read through an unsafe stale pointer, preserving the old NULL-skip intent without the race. Performance impact: each affected alert scan adds one chart dictionary acquire/release and pointer validation before the existing alert spinlock traversal and JSON or counter work. The added cost is O(1) per scanned instance and only affects alert-related query formatting paths. Validation: git diff --cached --check passed. Source scan confirmed no callers of the old helper remain and the new acquire helper has only the expected definition, declaration, and four call sites. Isolated worker validation compiled the changed objects with low priority; full netdata target was blocked by unrelated missing ACLK proto checkout content. Local reviewer gate approved the lifetime contract and caller impact; a cosmetic EOF whitespace note was checked and removed from the final staged diff.
ks_2samp() compared baseline indexes with highlight indexes scaled by base_shifts using int delta = base_idx - (high_idx << base_shifts). If base_shifts was too large, or the scaled highlight index exceeded the signed range, the left shift and subtraction could invoke undefined behavior before the later probability checks. This is a latent implementation bug in a helper used by current metric-correlation paths. Current API and MCP callers bound points and shifts so valid requests do not reach the invalid shift today, but the helper itself accepted out-of-contract internal inputs and had no local representability guard. Reject unrepresentable scaled-index inputs with NAN before sorting, build the power-of-two multiplier only after bounding base_shifts, and widen the intermediate delta/min/max comparisons to int64_t. The guard uses high_size because every high_idx returned by binary_search_bigger_than() is bounded by high_size. Functional impact: none for valid current callers. Existing API, MCP, JSON, configuration, and query contracts are unchanged; impossible or future out-of-contract inputs now return NAN, which the existing correlation result path already treats as a non-result. Performance impact: negligible. The hot path adds one unlikely guard and a bounded multiplier loop per ks_2samp() call, then uses int64_t comparisons in loops already dominated by qsort(), binary searches, and probability calculation. Validation: git diff --cached --check passed. Isolated validation configured a minimal low-priority build, linked netdata, and passed the existing -W mctest path including the new invalid-shift NAN case; broader build setup was blocked by unrelated local dependency/submodule gaps before the minimal build was prepared. Reviewer gate verified caller bounds, shift representability, NAN handling, and no caller-visible behavior change for valid requests.
api_v1_allmetrics.c and api_v1_alarms.c passed plain char values to isalnum() and toupper(). On signed-char platforms, high-bit request or metadata bytes promote to negative int values that are outside the C/POSIX ctype input domain unless they equal EOF. This is a potential current bug: the endpoints are reachable today, and the source violates the ctype contract for non-ASCII bytes, but the worker input did not include a reproduced crash or corruption artifact. Cast the byte arguments to uint8_t before calling isalnum() or toupper(), and keep the existing char assignment behavior after toupper(). This matches existing Netdata ctype-call patterns in nearby web and URL code. Functional impact: ASCII behavior, endpoint dispatch, URL tokenization, response shape, buffer ownership, and locking are unchanged. High-bit bytes move from undefined behavior to defined locale-aware ctype behavior. Performance impact: negligible. The change adds only integer casts on bytes already being processed. Validation: git diff --cached --check passed. Isolated validation passed targeted ctype scans, low-priority CMake configure, and direct compiles of the changed translation units; broad Ninja object build was blocked by unrelated missing ACLK proto checkout content. Reviewer gate verified the ctype contract, all call paths, and absence of side effects.
url_is_request_complete_and_extract_payload() parsed POST and PUT Content-Length with str2ul(), which silently stops at the first non-digit, accepts missing or sign-prefixed digits as zero, and ignores integer overflow. A malformed request could therefore be treated as complete when the buffered payload length happened to match the truncated or wrapped value. This is a current parser validation bug in the web request path. Memory corruption was not proven; the concrete fault is unsafe request framing acceptance for malformed Content-Length values on paths exercised by current web and ACLK request validation flows. Parse the field value with strtoull(), require at least one digit, reject errno overflow, reject values above SIZE_MAX, skip only HTTP optional whitespace around the value, and require CRLF immediately after the value. Keep the existing false return contract for incomplete or invalid requests. Functional impact: valid decimal Content-Length values, zero-length bodies, and SP/HTAB optional whitespace remain accepted. Missing digits, sign prefixes, trailing garbage, overflow, and non-CRLF-terminated values are rejected through the existing incomplete/invalid validation path. Performance impact: negligible. The POST/PUT completeness check adds one strtoull(), errno check, SIZE_MAX comparison, and short whitespace scans before the existing payload-length comparison. Validation: git diff --cached --check passed. Isolated worker validation compiled url.c and main.c with low priority and exercised the parser cases through a temporary url_unittest runner; full netdata target was blocked by unrelated missing ACLK proto checkout content. The final commit intentionally excludes the worker's test wiring to keep the fix surgical and avoid changing the existing -W unittest surface. Reviewer gate verified caller contracts, valid-flow preservation, malformed-flow rejection, bounds safety, and portability.
4f3ac7d to
7a94efe
Compare
|
@cubic-dev-ai please review again |
@ktsaou I have started the AI code review. It will take a few minutes to complete. |
There was a problem hiding this comment.
0 issues found across 1 file (changes from recent commits).
You've manually re-run cubic several times on this PR. Each manual re-review checks the full PR again and counts toward your usage quota. To preserve your usage limits, we recommend letting cubic automatically review new commits.
Re-trigger cubic
|
| if(unlikely(!end_keyword)) | ||
| return false; | ||
|
|
||
| size_t keyword_len = strlen(end_keyword); |
| if(unlikely(!end_keyword)) | ||
| return false; | ||
|
|
||
| size_t keyword_len = strlen(end_keyword); |
There was a problem hiding this comment.
Pull request overview
This draft PR hardens memory-safety and concurrency edges across the Agent’s web stack (WebSocket + HTTP), MCP JSON processing, and several API handlers, with the stated goal of avoiding undefined behavior/invalid access without functional changes.
Changes:
- Tighten type/shape validation for MCP table JSON inputs and improve WebSocket MCP context teardown to avoid stale pointers across lifecycle callbacks.
- Harden web/API surfaces: cap
/api/v3/settingsPUT payload size, add safer time parsing, and replace directRRDSETtimestamp writes with helper functions/locks. - Improve low-level safety in multiple places (e.g.,
ctype.husage with unsigned casts, safer WebSocket frame length encoding).
Reviewed changes
Copilot reviewed 279 out of 279 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/web/websocket/websocket.c | Adjusts client registration accounting and fixes debug formatting type for active client count. |
| src/web/websocket/websocket-send.c | Avoids undefined shifts by encoding 64-bit payload length explicitly for large WebSocket frames. |
| src/web/websocket/websocket-internal.h | Adds per-thread broadcast tracking fields for command/broadcast state. |
| src/web/websocket/websocket-handshake.c | Initializes new per-thread broadcast state fields during thread init. |
| src/web/server/web_client.h | Adds gzip/chunked response bookkeeping fields to support correct framing under backpressure. |
| src/web/server/static/static-threaded.c | Ensures polling re-arms read/write based on SSL “want read/write” state in callbacks. |
| src/web/mcp/mcp-tools-execute-function.c | Adds JSON type checks and fails table processing cleanly on malformed row/column structures. |
| src/web/mcp/mcp-params.c | Updates RFC3339 parsing to use a boolean success return and avoids relying on uninitialized/ambiguous results. |
| src/web/mcp/adapters/mcp-websocket.c | Centralizes MCP WebSocket context teardown and detaches context before freeing to avoid stale pointers. |
| src/web/api/v3/api_v3_settings.c | Enforces a maximum PUT payload size for settings updates. |
| src/web/api/v2/api_v2_claim.c | Serializes random session-id generation/access with a spinlock to avoid data races. |
| src/web/api/v1/api_v1_info.c | Uses RRDSET last-access helper and adds safer RRDCALC → RRDSET access with locking/unlocking. |
| src/web/api/v1/api_v1_charts.c | Replaces direct last-access timestamp write with rrdset_touch_last_accessed_time_s(). |
| src/web/api/v1/api_v1_badge/web_buffer_svg.c | Consolidates/strengthens “undefined” time-value detection and uses last-access helper. |
| src/web/api/v1/api_v1_allmetrics.c | Fixes ctype usage by casting to unsigned and adds locking to stabilize rrdset access during alert export. |
| src/web/api/v1/api_v1_alarms.c | Fixes toupper() usage by casting to unsigned and uses last-access helper. |
| src/web/api/queries/trimmed_mean/trimmed_mean.h | Changes entries type to size_t to avoid signed overflow/underflow hazards. |
| src/libnetdata/socket/socket.c | Adjusts fallback strncpyz() sizes (but needs guarding against zero-sized buffers to avoid underflow). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.





Summary
This draft PR collects surgical memory-safety fixes in netdata and closely related C code. Each fix is intended to be one verified issue per commit, with no intended functional change beyond avoiding undefined behavior or invalid memory access.
The branch will continue to receive small commits as additional scout findings are independently verified, externally reviewed, validated locally, and accepted as deterministic zero-functional-impact fixes.
Validation
Validation is recorded per commit/finding. Current local validation includes focused builds for affected targets and same-pattern scans where applicable.
Review Scope
This is intentionally draft while the remaining scout findings are triaged. Findings that require design, behavior, or risk-acceptance decisions will not be silently changed; they will be collected for discussion.
Summary by cubic
Hardens memory safety and concurrency across the Agent,
libnetdata, storage, exporters, streaming, ACLK/MQTT/WebSocket, Windows, and tools with no intended behavior changes. Adds tighter parser bounds, production guards, and better backpressure handling./api/v3/settingsPUT limited to 20 MiB; bound request-line protocol scan; validate daemon exit-status frames before applying.onewayallocsize arithmetic guarded; buffered-reader retries EINTR; boundedstring_strndupz(); object-state deactivation underflow guard; clocks: stop absolute-sleep on hard errors, reject zero resolutions, lock heartbeat stats; Windowstinysleep()actually sleeps.Written for commit 7a94efe. Summary will update on new commits.