{{ message }}
Tags: keploy/keploy
Tags
ci(setup-private-parsers): hardcode GitHub host keys, drop ssh-keyscan ( #4116) The v3.5.0 release.yml run failed on the Windows build leg at `Add Private Parsers` → `ssh-keyscan github.com` with: choose_kex: unsupported KEX method sntrup761x25519-sha512@openssh.com Windows-runner OpenSSH can't complete GitHub's KEX negotiation since GitHub started offering sntrup761 in its SSH handshake. The client aborts, known_hosts stays empty, and every subsequent git-over-SSH operation in the job fails. Darwin/Linux both happen to carry newer OpenSSH so they didn't trip this. Replace `ssh-keyscan` with GitHub's published SSH fingerprints (Ed25519 / ECDSA / RSA; see GitHub docs). This: - fixes the Windows release leg (no KEX negotiation involved), - hardens supply-chain — we never trust whatever ssh-keyscan hands back from the wire, - matches what every hardened CI does (hashicorp, cloudflare, pulumi all bake the fingerprints in). Rotate these literals if GitHub rotates a key. Last RSA rotation was 2023-03-24; Ed25519/ECDSA are long-lived. Signed-off-by: slayerjain <shubhamkjain@outlook.com>
ci(workflows): drop post-#4101 INTEGRATION_REF SHA pin, default to ma… …in (#4114) keploy/integrations#130 merged at df3b862 — the v3 parser implementation now lives on integrations/main, so the three workflows (golangci-lint, release, prepare_and_run) no longer need the temporary pinned SHA that carried the cross-repo change while #4101 and #130 were in flight. Flips the fallback from the pinned SHA to `main`. The `vars.INTEGRATION_REF` override path is preserved for future cross- repo feat branches; set the repo var to pin to a specific SHA when needed. Docstring cleanup: drops the "TEMPORARY" / "Revert to main" phrasing now that the code reflects the intended steady state. Signed-off-by: slayerjain <shubhamkjain@outlook.com>
fix(replay): write mappings for streaming test cases (#4099) * fix(replay): stop draining consumed mocks twice in streaming path The Phase-2 streaming loop called GetConsumedMocks a second time under an "Update consumed mocks map" block after already calling it for the mock-mismatch check. MockManager.GetConsumedMocks drains the consumed list on read, so the second call returned an empty slice and overwrote consumedMocks. upsertActualTestMockMapping then early-returned on the empty slice, leaving streaming test cases missing from mappings.yaml even when they passed and consumed mocks. The first call already populates totalConsumedMocks and mockNames, and SimulateRequest is synchronous for streaming, so a single drain is sufficient. Closes #4098 * fix(replay): union pre-stream and in-stream consumed mocks Follow-up to the previous commit. Removing the second GetConsumedMocks drain entirely is not sufficient: SimulateHTTPStreaming returns as soon as the response headers arrive (pkg/util.go:602 — it returns a StreamingHTTPResponse wrapping httpResp.Body without draining it), so the stream body is consumed later by CompareHTTPStream above. Mocks consumed during that body read (e.g., one backend call per SSE frame) land in MockManager's consumedList *after* the pre-stream drain ran. Restore the second drain, but append its result to consumedMocks (and mockNames) instead of overwriting. upsertActualTestMockMapping now receives the union of both drains, so streaming test cases get complete mock_entries regardless of whether their backend calls happen before, during, or across the stream body transmission. Verified on a local SSE repro with three streaming test cases whose backend is called once per frame: before this commit: mappings.yaml has only the pre-stream mock per test (in-stream mocks silently dropped) after this commit: mappings.yaml has all mocks, attributed to the correct owning test case Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix(replay): recompute streaming mismatch after union drain; correct SimulateRequest comment Addresses Copilot review on #4099: - mockSetMismatch / emitFailureLogs were previously computed from only the pre-stream drain. When extra mocks are consumed while CompareHTTPStream reads the body, the subset check missed them and the mismatch decision (and its downstream failure/ignore logging) could be wrong. Re-evaluate mismatch with the unioned consumedMocks right before CompareHTTPResp. Preserve a streaming-body-mismatch forced emitFailureLogs=false (hadStreamingMismatch guard). - The comment above SimulateRequest claimed the call "blocks until stream is done", which is inaccurate — SimulateHTTPStreaming returns a reader once headers arrive and the body is drained later by CompareHTTPStream. Rewrite the comment so it no longer contradicts the post-stream drain below. - Touch up the pre-stream drain comment to make its role explicit: a snapshot used only for the body-comparison log gate; the mismatch is finalized after the union drain. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
feat(test): enable strictMockWindow by default (#4096) Flip config.Test.StrictMockWindow to default true now that every stateful-protocol recorder classifies mocks with fine-grained Lifetime (session / connection / per-test). Legitimate cross-test sharing is encoded as Session or Connection lifetime, so the implicit out-of-window promotion that lax mode preserved is no longer needed for correctness. Escape hatches are unchanged: set test.strictMockWindow: false in keploy.yaml, or export KEPLOY_STRICT_MOCK_WINDOW=0 — the env var still wins over config for users who hit trouble with older recordings and need to opt out without editing config. Signed-off-by: Shubham Jain <shubhamkjain@outlook.com>
fix(models): persist HTTPReq/HTTPResp Timestamp as RFC3339Nano in BSON ( #4094) MongoDB's BSON DateTime is an int64 count of milliseconds, so the default `time.Time` BSON encoder silently drops nanoseconds. For cloud-replay this is load-bearing: HTTPReq.Timestamp and HTTPResp.Timestamp become the edges of the mock-matching time window, and the closing edge is frequently in the same millisecond as the last MongoDB mock's reqTimestampMock. After a BSON round-trip (api-server stores test cases in Mongo while mocks round- trip through blob storage and keep full precision), the mock's ns timestamp lands a few microseconds past the ms-floored resp.ts and is pushed out of the filter pool, starving the `find` matcher. Adds MarshalBSON / UnmarshalBSON on HTTPReq and HTTPResp that serialise Timestamp as an RFC3339Nano string, matching the strategy MockEntry in mappings.go already uses for reqTimestampMock / resTimestampMock. Field names on the shadow structs follow the v1/v2 driver default (`strings.ToLower(fieldName)`) so the BSON wire shape is unchanged for every field except `timestamp`, which flips from BSON DateTime to BSON String. UnmarshalBSON transparently accepts either shape, so existing records written before this change continue to decode — no migration required. Adds ParseMockTimestamp as the inverse of FormatMockTimestamp (accepts both RFC3339 and RFC3339Nano for backward compatibility with older fixtures) and a decodeBSONTimestamp helper on the HTTP side that dispatches on BSON type (String / DateTime / Null). Tests cover the three relevant cases: nanosecond round-trip preservation, legacy BSON DateTime decode, and zero-timestamp round-trip. Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>
feat: log pruned mock details during UpdateMocks (#4091) * feat: log pruned mock details during UpdateMocks Collect name, kind, and metadata for each mock dropped by UpdateMocks into a single slice and emit it on the existing "pruned mocks successfully" debug line. Applied to both the YAML and gob prune paths so they stay consistent. Keeps one log per prune call instead of one per mock to avoid log pollution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: address Copilot review on pruned-mock logging - Gate per-mock collection on ys.Logger.Core().Enabled(zap.DebugLevel) so large test sets don't pay allocation/reflection cost when debug logs will be dropped. Falls back to a plain prunedCount integer. - Cap prunedMocks at 100 entries (maxPrunedMocksLogged) and emit prunedMocksTruncated=true when the true pruned count exceeds the logged slice, so --debug runs don't generate multi-MB log lines on ~10^5-mock test sets. - Applied symmetrically to both the YAML and gob prune paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fix: correct mappings.yaml backfill, silence self-heal spam, restore … …agent mode wiring (#4084) * fix(replay): scope mappings.yaml backfill to each test's record-time window upsertActualTestMockMapping unconditionally appended every mock returned by GetConsumedMocks() to the current test case, so mapping attribution tracked replay-time consumption order rather than the recording. A mock captured during app-startup but consumed during a later test's replay (e.g. Redis HELLO after a client reconnect) leaked into the wrong test's mock_entries in the backfilled mappings.yaml, producing non-deterministic mappings across subsequent replays. Filter by the mock's recorded ReqTimestampMock/ResTimestampMock against the test case's HTTPReq/HTTPResp (or gRPC equivalent) timestamps, with a Created-based fallback for legacy fixtures. DNS mocks are exempt because they have no stable record-time window. Legacy behavior preserved when either timestamp is zero (e.g. very old recordings). Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(proxy): demote self-heal per-kind tree log to Debug UpdateUnFilteredMock's post-update self-heal branch fires on every legitimate filtered→unfiltered promotion — the Postgres v2 matcher is the load-bearing caller, and it hits this path dozens of times per normal replay. Logging it at WARN creates noise that customers have reported as spurious failures; the subsequent insert recovers the per-kind tree state correctly. Demote the log to Debug and reword the message to reflect what's actually happening (seeding per-kind from global on first update) so operators who do enable debug logging get a clearer signal. No semantic change to the tree mutation — the self-heal insert still runs unconditionally when global updated and per-kind missed. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(cli): tolerate missing --json flag in ValidateFlags so agent mode wiring still runs ValidateFlags called cmd.Flags().GetBool("json") unconditionally and returned an error when the flag was not registered on the subcommand. The enterprise build only registers --json on UI commands, so for record/test/agent the call returned early — and the subsequent case "agent": branch that wires c.cfg.Agent.Mode from --mode never ran. The enterprise agent subprocess therefore started with mode="", bound its HTTP server on port 0, and the parent eventually timed out with "keploy-agent did not become ready in time". Use Lookup first and fall back to false when the flag isn't present on the current subcommand, preserving the original behavior when it is registered. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * chore: re-trigger CI after stuck Windows gin-mongo runner Previous job 72176310203 ran 30 minutes on the "Run gin-mongo application" step and was cancelled by the runner's timeout. All three committed changes on this branch are either replay-only (pkg/service/replay/*) or log-level only (UpdateUnFilteredMock's self-heal Debug demotion) or a no-op on OSS builds (the ValidateFlags json flag path — OSS registers --json as a PersistentFlag on the root keploy command so Lookup always succeeds). Record-mode behaviour is unchanged by this PR, so the Windows record-iteration-2 hang looks like infrastructure flake rather than a code defect. Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: slayerjain <shubhamkjain@outlook.com>
fix: generic mock flush when response complete (#4082) * fix: generic mock flush when response complete Signed-off-by: kapish <upadhyaykapish@gmail.com> * chore: improved comment Signed-off-by: kapish <upadhyaykapish@gmail.com> --------- Signed-off-by: kapish <upadhyaykapish@gmail.com>
fix(proxy,tools): keep ERROR level, attach next_step guidance (#4078) * fix(proxy,tools): keep ERROR level, attach next_step guidance Earlier iteration of this PR downgraded three classes of log sites from ERROR to WARN because CI pipelines were filtering them with `grep -v` allowlists. That was the wrong direction — lowering the severity to match what the pipeline already tolerates hides the signal from operators reading production logs, not just from CI. ERROR should keep meaning "something real went wrong"; the job of the pipeline is to avoid producing ERRORs in the first place (by waiting for dependencies to be ready, using --delay correctly, etc.), not to filter them after the fact. This change keeps the following sites at ERROR and instead attaches a structured `next_step` field so the operator — in CI or in prod — knows the exact follow-up when they see the log: - pkg/agent/proxy/proxy.go × 4 dial sites: "failed to dial the conn to destination server" → next_step explains that the proxy only attempts one dial and does not retry, and points at the three real fixes (start the dependency first, raise --delay, or use a readiness probe). - pkg/agent/proxy/util/util.go × 1 dial site: same guidance for TLS-wrapped dial path. - pkg/service/tools/tools.go: "failed to write config file" now carries the target path and permission/ownership next_step — the common cause is a leftover read-only keploy.yml from a prior `sudo keploy` invocation that the current non-root user can no longer overwrite. No allowlist changes here; CI-script cleanup lives in the enterprise PR. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(cli): don't create keploy.yml from the agent subcommand The agent process is a worker spawned by the parent keploy (record/test/etc.) and talks to it over CLI flags + the agent HTTP API. It never reads keploy.yml at runtime and has no need to persist one to disk. When the agent runs inside a docker sidecar (the normal case for `keploy record -c "docker compose up"`), the host's absolute --config-path is forwarded through argv but does not resolve to a real directory inside the container's filesystem. That made the "create config file if missing" branch in Validate reach tools.CreateConfig → os.WriteFile → ENOENT, which logged ERROR ("failed to write config file") — a noise signal kafka-ecommerce CI had been masking with a `grep -v` allowlist for exactly this reason. Seen concretely on keploy/enterprise#1867 pipeline 2541/10 after the allowlist was dropped. Gate the config-create branch to skip when cmd.Name() == "agent". Same pattern as the existing `cmd.Name() == "agent"` / `!= "agent"` gates at lines 613/638/650 of this file — confirming that the agent subcommand is already special-cased elsewhere for similar reasons. No change for record/test/serve/config/etc.; only the agent worker stops attempting a persistence it never needed. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(proxy,tools): address Copilot review on #4078 Five threads: 1. tools.CreateConfig returned nil on os.WriteFile failure, so callers (cli/config.go handler, CmdConfigurator.CreateConfigFile) logged "Config file generated successfully" right after an ERROR line for the same op. Return the actual error — both callers already check it — and the inconsistency goes away. 2. util.go's non-TLS dial path had a bare "failed to dial the destination server" with no server address and no next_step, while the TLS path immediately above it had both. Harmonised to the same message + fields. 3. Auxiliary proxy hook behavior change (propagating the error vs. swallowing it) is now called out in the PR description so release notes pick it up. 4. proxy.go's aux-hook block used to do both LogError and fmt.Errorf("%w") — the first line logged, the second propagated; the caller logs the wrapped error too, producing double-logging. Dropped the LogError at the hook site. The hook's returned error already carries the structured "Next steps: …" remediation (enterprise sockmap_proxy.go), and the caller surfaces it once. 5. Extracted the duplicated next_step string for dial failures into util.NextStepDialDestination. All six call sites (two in util.go, four in proxy.go) now reference the const, so future edits to the guidance don't have to update six copies. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix: proxyCtxCancel on StartProxy failure + clarify agent config comment Addresses the second Copilot review round on #4078: - service/agent.go: when StartProxy returns an error we were leaking the proxy goroutines (TCP accept loop, TCP DNS, UDP DNS) that had already been spawned by the errgroup before the failure point. They stayed alive until the outer agent context was cancelled. Matters more now that StartProxy propagates auxiliary-hook failures instead of swallowing them — failure paths are no longer dead code. Call proxyCtxCancel() on the error path so the partial proxy is torn down immediately. - proxy/proxy.go: restore a single structured LogError at the aux-hook error site. The hook's returned error already embeds its own "Next steps:" remediation (enterprise sockmap_proxy.go wraps the BPF verifier error with kernel/flag/rebuild guidance); the LogError here adds a generic next_step field pointing at that embedded chain, so anyone reading the log sees both the actionable remediation and the raw cause without chasing the error through every caller. Still one log per failure — caller does not re-log the same error. - cli/provider/cmd.go: reword the agent config-skip comment. The old phrasing said the agent "never reads keploy.yml from disk" which is inaccurate — PreProcessFlags still calls viper.ReadInConfig() for agent. The precise statement is that the agent doesn't need to CREATE the file if missing, because the parent has already resolved the effective config and handed it over via flags + env. Clarified in-place. Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: slayerjain <shubhamkjain@outlook.com>
fix(syncMock): bounded-block sendToOutChan so burst capture doesn't s… …ilently drop mocks (#4076) * fix(syncMock): bounded-block sendToOutChan so burst capture doesn't silently drop mocks sendToOutChan used a non-blocking send with a default-drop: select { case m.outChan <- mock: default: } Under burst capture on an oversubscribed runner (CI, a customer's box under GC pressure, downstream agent momentarily backpressuring the outChan), this silently dropped pre-first-request mocks. Customers saw "some calls didn't replay" with no actionable log signal — the anti-pattern the Windows redirector shutdown loop hit. Replace with: 1. Non-blocking fast path (unchanged — the common case where the consumer is keeping up has zero latency cost). 2. On fast-path miss, fall through to a bounded block (sendBudget = 200 ms). This absorbs a GC pause or a transient downstream stall without losing data. 3. On bounded-block timeout, increment a process-global dropCount and emit a rate-limited Warn (first drop immediately, then every 1024th). Operators see the drop instead of wondering why replay is missing mocks. The RWMutex contract is preserved: the send still runs under outChanMu.RLock, so CloseOutChan (write-lock holder) can't interleave a close between the not-closed check and the send. The only new cost is CloseOutChan may now wait up to sendBudget for in-flight sends — imperceptible vs. a process shutdown, and every concurrent sender was already racing the same RLock. Fixes a recurring Redis loss-rate / channel-race flake on keploy/enterprise's CI where the tests feed syncMock via SetOutputChannel and never call SetFirstRequestSignaled, so every recorded mock flows through this path. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * fix(syncMock): plumb logger, atomic.Uint64 drop counter, drop-path tests Addresses the Copilot feedback on the bounded-send PR: - Plumb *zap.Logger through SyncMockManager (SetLogger) and log the overflow via that logger instead of zap.L(). syncMock is a package-level singleton that loads before any zap.ReplaceGlobals, so zap.L() silently fell back to Nop and the drop Warn never reached operators. Proxy.New now wires the proxy logger in once. - Promote the sampled drop log from Warn to Error with a shorter, actionable message so overflow is loud and grep-friendly. - Switch dropCount from bare uint64 + atomic.AddUint64 to sync/atomic.Uint64 so the 32-bit alignment requirement becomes a compile-time property instead of a field-ordering footgun. - Export DropCount() for tests and external observability. - Add regression tests covering: no-drop when the consumer drains within sendBudget, drop + Error-log + dropCount increment past budget, sampling cadence (n=1, 1024, 2048), nil-logger Nop fallback, and concurrent-increment atomicity under load. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci(windows/go-dedup): robust app-ready probe + per-request retry The Windows go-http-no-deps record phase on #4076 run 24629542035/job/72014589521 hit a 0-tests-recorded failure because the readiness probe broke on a single 200 response, and that first 200 came from a transient Docker Desktop port-publish stub rather than the in-container Go binary. Four seconds later the real traffic got "Unable to connect to the remote server" on the first request, and the single try/catch wrapping all six calls swallowed the error with $sent=0. No test-set-0 directory ever formed, and the script exited with "Test directory not found". Three-part fix, applied identically to the docker and native variants under golang/go-dedup: 1. Readiness probe now requires $stabilityCount (=3) consecutive 200 responses with a 1s gap before declaring ready. A single flap from a port-forwarder stub can no longer race real traffic. 2. After the probe succeeds, settle for $settleSec (=5) before firing traffic so keploy's recording-proxy interception layer has time to fully wire up against the container's listener. 3. Replace the all-in-one try/catch around the six HTTP calls with an Invoke-WithRetry helper (5 attempts, 1.5x exponential backoff). A transient connection-refused on call #1 no longer aborts calls #2..#6. If the probe never reaches the stability threshold inside the deadline, the script still proceeds so downstream record validation can surface the failure with its own clearer error message, instead of us silently pretending the app was ready. Signed-off-by: slayerjain <shubhamkjain@outlook.com> * ci(windows/go-dedup): literal port substitution + patched YAML dump Root cause of the #4076 run 24629876059 failure: the `(?m)("?)8080:8080\1` backref regex intermittently produces a malformed port line on Windows PowerShell 5.1 — the leading quote is stripped while the trailing one is kept, yielding `- 54652:8080"`. docker-compose parses that as `containerPort: 8080"` and rejects it with `invalid containerPort: 8080"`. No container starts, every HTTP call gets ECONNREFUSED, and the record phase records zero tests. My earlier readiness-probe + retry work on this PR correctly surfaced the downstream symptom (consecutive 200s never observed, all six traffic requests exhaust their retry budget), but it was papering over the actual bug: the YAML itself was broken before docker-compose ever ran. No amount of client-side resilience can recover from a container that was never started. This change: 1. Replaces the backref regex with a plain `.Replace("8080:8080", "<port>:8080")` — the samples-go/go-dedup compose file uses the double-quoted short form exclusively, so literal substitution is both simpler and provably correct. 2. Validates that the expected port fragment is present after substitution, throwing loudly with the unpatched YAML printed if the sample ever introduces a form the script doesn't handle. Silent fall-through to an unpatched run was what let this slip undetected for so long. 3. Dumps the patched YAML to the job log so any future containerPort / hostPort diagnostic lands with the exact text docker-compose parsed — no more guessing whether the regex behaved. Native-windows script is left untouched — it runs the Go binary directly, not dockerized, and does not hit this code path. Signed-off-by: slayerjain <shubhamkjain@outlook.com> --------- Signed-off-by: slayerjain <shubhamkjain@outlook.com>
PreviousNext
