Tags · keploy/keploy · GitHub
Skip to content

Tags: keploy/keploy

Tags

v3.5.1

Toggle v3.5.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
ci(setup-private-parsers): hardcode GitHub host keys, drop ssh-keyscan (

#4116)

The v3.5.0 release.yml run failed on the Windows build leg at
`Add Private Parsers` → `ssh-keyscan github.com` with:

  choose_kex: unsupported KEX method sntrup761x25519-sha512@openssh.com

Windows-runner OpenSSH can't complete GitHub's KEX negotiation since
GitHub started offering sntrup761 in its SSH handshake. The client
aborts, known_hosts stays empty, and every subsequent git-over-SSH
operation in the job fails. Darwin/Linux both happen to carry newer
OpenSSH so they didn't trip this.

Replace `ssh-keyscan` with GitHub's published SSH fingerprints
(Ed25519 / ECDSA / RSA; see GitHub docs). This:
  - fixes the Windows release leg (no KEX negotiation involved),
  - hardens supply-chain — we never trust whatever ssh-keyscan
    hands back from the wire,
  - matches what every hardened CI does (hashicorp, cloudflare,
    pulumi all bake the fingerprints in).

Rotate these literals if GitHub rotates a key. Last RSA rotation
was 2023-03-24; Ed25519/ECDSA are long-lived.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

v3.5.0

Toggle v3.5.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
ci(workflows): drop post-#4101 INTEGRATION_REF SHA pin, default to ma…

…in (#4114)

keploy/integrations#130 merged at df3b862 — the v3 parser
implementation now lives on integrations/main, so the three
workflows (golangci-lint, release, prepare_and_run) no longer need
the temporary pinned SHA that carried the cross-repo change while
#4101 and #130 were in flight.

Flips the fallback from the pinned SHA to `main`. The
`vars.INTEGRATION_REF` override path is preserved for future cross-
repo feat branches; set the repo var to pin to a specific SHA when
needed.

Docstring cleanup: drops the "TEMPORARY" / "Revert to main" phrasing
now that the code reflects the intended steady state.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

v3.4.10

Toggle v3.4.10's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(replay): write mappings for streaming test cases (#4099)

* fix(replay): stop draining consumed mocks twice in streaming path

The Phase-2 streaming loop called GetConsumedMocks a second time under
an "Update consumed mocks map" block after already calling it for the
mock-mismatch check. MockManager.GetConsumedMocks drains the consumed
list on read, so the second call returned an empty slice and overwrote
consumedMocks. upsertActualTestMockMapping then early-returned on the
empty slice, leaving streaming test cases missing from mappings.yaml
even when they passed and consumed mocks.

The first call already populates totalConsumedMocks and mockNames, and
SimulateRequest is synchronous for streaming, so a single drain is
sufficient.

Closes #4098

* fix(replay): union pre-stream and in-stream consumed mocks

Follow-up to the previous commit. Removing the second GetConsumedMocks
drain entirely is not sufficient: SimulateHTTPStreaming returns as soon
as the response headers arrive (pkg/util.go:602 — it returns a
StreamingHTTPResponse wrapping httpResp.Body without draining it), so
the stream body is consumed later by CompareHTTPStream above. Mocks
consumed during that body read (e.g., one backend call per SSE frame)
land in MockManager's consumedList *after* the pre-stream drain ran.

Restore the second drain, but append its result to consumedMocks (and
mockNames) instead of overwriting. upsertActualTestMockMapping now
receives the union of both drains, so streaming test cases get complete
mock_entries regardless of whether their backend calls happen before,
during, or across the stream body transmission.

Verified on a local SSE repro with three streaming test cases whose
backend is called once per frame:
  before this commit: mappings.yaml has only the pre-stream mock
                      per test (in-stream mocks silently dropped)
  after this commit:  mappings.yaml has all mocks, attributed to
                      the correct owning test case

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix(replay): recompute streaming mismatch after union drain; correct SimulateRequest comment

Addresses Copilot review on #4099:

- mockSetMismatch / emitFailureLogs were previously computed from only
  the pre-stream drain. When extra mocks are consumed while
  CompareHTTPStream reads the body, the subset check missed them and
  the mismatch decision (and its downstream failure/ignore logging)
  could be wrong. Re-evaluate mismatch with the unioned consumedMocks
  right before CompareHTTPResp. Preserve a streaming-body-mismatch
  forced emitFailureLogs=false (hadStreamingMismatch guard).

- The comment above SimulateRequest claimed the call "blocks until
  stream is done", which is inaccurate — SimulateHTTPStreaming returns
  a reader once headers arrive and the body is drained later by
  CompareHTTPStream. Rewrite the comment so it no longer contradicts
  the post-stream drain below.

- Touch up the pre-stream drain comment to make its role explicit:
  a snapshot used only for the body-comparison log gate; the mismatch
  is finalized after the union drain.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v3.4.9

Toggle v3.4.9's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat(test): enable strictMockWindow by default (#4096)

Flip config.Test.StrictMockWindow to default true now that every
stateful-protocol recorder classifies mocks with fine-grained Lifetime
(session / connection / per-test). Legitimate cross-test sharing is
encoded as Session or Connection lifetime, so the implicit out-of-window
promotion that lax mode preserved is no longer needed for correctness.

Escape hatches are unchanged: set test.strictMockWindow: false in
keploy.yaml, or export KEPLOY_STRICT_MOCK_WINDOW=0 — the env var
still wins over config for users who hit trouble with older
recordings and need to opt out without editing config.

Signed-off-by: Shubham Jain <shubhamkjain@outlook.com>

v3.4.8

Toggle v3.4.8's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(models): persist HTTPReq/HTTPResp Timestamp as RFC3339Nano in BSON (

#4094)

MongoDB's BSON DateTime is an int64 count of milliseconds, so the default
`time.Time` BSON encoder silently drops nanoseconds. For cloud-replay this
is load-bearing: HTTPReq.Timestamp and HTTPResp.Timestamp become the edges
of the mock-matching time window, and the closing edge is frequently in
the same millisecond as the last MongoDB mock's reqTimestampMock. After a
BSON round-trip (api-server stores test cases in Mongo while mocks round-
trip through blob storage and keep full precision), the mock's ns
timestamp lands a few microseconds past the ms-floored resp.ts and is
pushed out of the filter pool, starving the `find` matcher.

Adds MarshalBSON / UnmarshalBSON on HTTPReq and HTTPResp that serialise
Timestamp as an RFC3339Nano string, matching the strategy MockEntry in
mappings.go already uses for reqTimestampMock / resTimestampMock. Field
names on the shadow structs follow the v1/v2 driver default
(`strings.ToLower(fieldName)`) so the BSON wire shape is unchanged for
every field except `timestamp`, which flips from BSON DateTime to BSON
String. UnmarshalBSON transparently accepts either shape, so existing
records written before this change continue to decode — no migration
required.

Adds ParseMockTimestamp as the inverse of FormatMockTimestamp (accepts
both RFC3339 and RFC3339Nano for backward compatibility with older
fixtures) and a decodeBSONTimestamp helper on the HTTP side that dispatches
on BSON type (String / DateTime / Null).

Tests cover the three relevant cases: nanosecond round-trip
preservation, legacy BSON DateTime decode, and zero-timestamp round-trip.

Signed-off-by: Ayush Sharma <kshitij3160@gmail.com>

v3.4.7

Toggle v3.4.7's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
feat: log pruned mock details during UpdateMocks (#4091)

* feat: log pruned mock details during UpdateMocks

Collect name, kind, and metadata for each mock dropped by UpdateMocks
into a single slice and emit it on the existing "pruned mocks
successfully" debug line. Applied to both the YAML and gob prune
paths so they stay consistent. Keeps one log per prune call instead
of one per mock to avoid log pollution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: address Copilot review on pruned-mock logging

- Gate per-mock collection on ys.Logger.Core().Enabled(zap.DebugLevel)
  so large test sets don't pay allocation/reflection cost when debug
  logs will be dropped. Falls back to a plain prunedCount integer.
- Cap prunedMocks at 100 entries (maxPrunedMocksLogged) and emit
  prunedMocksTruncated=true when the true pruned count exceeds the
  logged slice, so --debug runs don't generate multi-MB log lines on
  ~10^5-mock test sets.
- Applied symmetrically to both the YAML and gob prune paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v3.4.6

Toggle v3.4.6's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: correct mappings.yaml backfill, silence self-heal spam, restore …

…agent mode wiring (#4084)

* fix(replay): scope mappings.yaml backfill to each test's record-time window

upsertActualTestMockMapping unconditionally appended every mock returned
by GetConsumedMocks() to the current test case, so mapping attribution
tracked replay-time consumption order rather than the recording. A mock
captured during app-startup but consumed during a later test's replay
(e.g. Redis HELLO after a client reconnect) leaked into the wrong test's
mock_entries in the backfilled mappings.yaml, producing non-deterministic
mappings across subsequent replays.

Filter by the mock's recorded ReqTimestampMock/ResTimestampMock against
the test case's HTTPReq/HTTPResp (or gRPC equivalent) timestamps, with a
Created-based fallback for legacy fixtures. DNS mocks are exempt because
they have no stable record-time window. Legacy behavior preserved when
either timestamp is zero (e.g. very old recordings).

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(proxy): demote self-heal per-kind tree log to Debug

UpdateUnFilteredMock's post-update self-heal branch fires on every
legitimate filtered→unfiltered promotion — the Postgres v2 matcher is
the load-bearing caller, and it hits this path dozens of times per
normal replay. Logging it at WARN creates noise that customers have
reported as spurious failures; the subsequent insert recovers the
per-kind tree state correctly.

Demote the log to Debug and reword the message to reflect what's
actually happening (seeding per-kind from global on first update) so
operators who do enable debug logging get a clearer signal. No semantic
change to the tree mutation — the self-heal insert still runs
unconditionally when global updated and per-kind missed.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(cli): tolerate missing --json flag in ValidateFlags so agent mode wiring still runs

ValidateFlags called cmd.Flags().GetBool("json") unconditionally and
returned an error when the flag was not registered on the subcommand.
The enterprise build only registers --json on UI commands, so for
record/test/agent the call returned early — and the subsequent
case "agent": branch that wires c.cfg.Agent.Mode from --mode never
ran. The enterprise agent subprocess therefore started with mode="",
bound its HTTP server on port 0, and the parent eventually timed out
with "keploy-agent did not become ready in time".

Use Lookup first and fall back to false when the flag isn't present
on the current subcommand, preserving the original behavior when it is
registered.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* chore: re-trigger CI after stuck Windows gin-mongo runner

Previous job 72176310203 ran 30 minutes on the "Run gin-mongo
application" step and was cancelled by the runner's timeout. All
three committed changes on this branch are either replay-only
(pkg/service/replay/*) or log-level only (UpdateUnFilteredMock's
self-heal Debug demotion) or a no-op on OSS builds (the ValidateFlags
json flag path — OSS registers --json as a PersistentFlag on the root
keploy command so Lookup always succeeds). Record-mode behaviour is
unchanged by this PR, so the Windows record-iteration-2 hang looks
like infrastructure flake rather than a code defect.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

---------

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

v3.4.5

Toggle v3.4.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: generic mock flush when response complete (#4082)

* fix: generic mock flush when response complete

Signed-off-by: kapish <upadhyaykapish@gmail.com>

* chore: improved comment

Signed-off-by: kapish <upadhyaykapish@gmail.com>

---------

Signed-off-by: kapish <upadhyaykapish@gmail.com>

v3.4.4

Toggle v3.4.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(proxy,tools): keep ERROR level, attach next_step guidance (#4078)

* fix(proxy,tools): keep ERROR level, attach next_step guidance

Earlier iteration of this PR downgraded three classes of log sites
from ERROR to WARN because CI pipelines were filtering them with
`grep -v` allowlists. That was the wrong direction — lowering the
severity to match what the pipeline already tolerates hides the
signal from operators reading production logs, not just from CI.
ERROR should keep meaning "something real went wrong"; the job of
the pipeline is to avoid producing ERRORs in the first place (by
waiting for dependencies to be ready, using --delay correctly,
etc.), not to filter them after the fact.

This change keeps the following sites at ERROR and instead attaches
a structured `next_step` field so the operator — in CI or in prod
— knows the exact follow-up when they see the log:

- pkg/agent/proxy/proxy.go × 4 dial sites: "failed to dial the
  conn to destination server" → next_step explains that the proxy
  only attempts one dial and does not retry, and points at the
  three real fixes (start the dependency first, raise --delay, or
  use a readiness probe).
- pkg/agent/proxy/util/util.go × 1 dial site: same guidance for
  TLS-wrapped dial path.
- pkg/service/tools/tools.go: "failed to write config file" now
  carries the target path and permission/ownership next_step — the
  common cause is a leftover read-only keploy.yml from a prior
  `sudo keploy` invocation that the current non-root user can no
  longer overwrite.

No allowlist changes here; CI-script cleanup lives in the
enterprise PR.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(cli): don't create keploy.yml from the agent subcommand

The agent process is a worker spawned by the parent keploy
(record/test/etc.) and talks to it over CLI flags + the agent
HTTP API. It never reads keploy.yml at runtime and has no need
to persist one to disk.

When the agent runs inside a docker sidecar (the normal case
for `keploy record -c "docker compose up"`), the host's absolute
--config-path is forwarded through argv but does not resolve to
a real directory inside the container's filesystem. That made
the "create config file if missing" branch in Validate reach
tools.CreateConfig → os.WriteFile → ENOENT, which logged ERROR
("failed to write config file") — a noise signal kafka-ecommerce
CI had been masking with a `grep -v` allowlist for exactly this
reason. Seen concretely on keploy/enterprise#1867 pipeline
2541/10 after the allowlist was dropped.

Gate the config-create branch to skip when cmd.Name() ==
"agent". Same pattern as the existing `cmd.Name() == "agent"`
/ `!= "agent"` gates at lines 613/638/650 of this file —
confirming that the agent subcommand is already special-cased
elsewhere for similar reasons.

No change for record/test/serve/config/etc.; only the agent
worker stops attempting a persistence it never needed.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(proxy,tools): address Copilot review on #4078

Five threads:

1. tools.CreateConfig returned nil on os.WriteFile failure, so
   callers (cli/config.go handler, CmdConfigurator.CreateConfigFile)
   logged "Config file generated successfully" right after an
   ERROR line for the same op. Return the actual error — both
   callers already check it — and the inconsistency goes away.

2. util.go's non-TLS dial path had a bare
   "failed to dial the destination server" with no server
   address and no next_step, while the TLS path immediately
   above it had both. Harmonised to the same message + fields.

3. Auxiliary proxy hook behavior change (propagating the error
   vs. swallowing it) is now called out in the PR description
   so release notes pick it up.

4. proxy.go's aux-hook block used to do both LogError and
   fmt.Errorf("%w") — the first line logged, the second
   propagated; the caller logs the wrapped error too, producing
   double-logging. Dropped the LogError at the hook site. The
   hook's returned error already carries the structured "Next
   steps: …" remediation (enterprise sockmap_proxy.go), and
   the caller surfaces it once.

5. Extracted the duplicated next_step string for dial failures
   into util.NextStepDialDestination. All six call sites (two
   in util.go, four in proxy.go) now reference the const, so
   future edits to the guidance don't have to update six copies.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix: proxyCtxCancel on StartProxy failure + clarify agent config comment

Addresses the second Copilot review round on #4078:

- service/agent.go: when StartProxy returns an error we were
  leaking the proxy goroutines (TCP accept loop, TCP DNS, UDP
  DNS) that had already been spawned by the errgroup before the
  failure point. They stayed alive until the outer agent context
  was cancelled. Matters more now that StartProxy propagates
  auxiliary-hook failures instead of swallowing them — failure
  paths are no longer dead code. Call proxyCtxCancel() on the
  error path so the partial proxy is torn down immediately.

- proxy/proxy.go: restore a single structured LogError at the
  aux-hook error site. The hook's returned error already embeds
  its own "Next steps:" remediation (enterprise sockmap_proxy.go
  wraps the BPF verifier error with kernel/flag/rebuild
  guidance); the LogError here adds a generic next_step field
  pointing at that embedded chain, so anyone reading the log
  sees both the actionable remediation and the raw cause without
  chasing the error through every caller. Still one log per
  failure — caller does not re-log the same error.

- cli/provider/cmd.go: reword the agent config-skip comment.
  The old phrasing said the agent "never reads keploy.yml from
  disk" which is inaccurate — PreProcessFlags still calls
  viper.ReadInConfig() for agent. The precise statement is that
  the agent doesn't need to CREATE the file if missing, because
  the parent has already resolved the effective config and
  handed it over via flags + env. Clarified in-place.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

---------

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

v3.4.3

Toggle v3.4.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix(syncMock): bounded-block sendToOutChan so burst capture doesn't s…

…ilently drop mocks (#4076)

* fix(syncMock): bounded-block sendToOutChan so burst capture doesn't silently drop mocks

sendToOutChan used a non-blocking send with a default-drop:

    select {
    case m.outChan <- mock:
    default:
    }

Under burst capture on an oversubscribed runner (CI, a customer's
box under GC pressure, downstream agent momentarily backpressuring
the outChan), this silently dropped pre-first-request mocks.
Customers saw "some calls didn't replay" with no actionable log
signal — the anti-pattern the Windows redirector shutdown loop hit.

Replace with:

 1. Non-blocking fast path (unchanged — the common case where the
    consumer is keeping up has zero latency cost).
 2. On fast-path miss, fall through to a bounded block
    (sendBudget = 200 ms). This absorbs a GC pause or a transient
    downstream stall without losing data.
 3. On bounded-block timeout, increment a process-global dropCount
    and emit a rate-limited Warn (first drop immediately, then
    every 1024th). Operators see the drop instead of wondering why
    replay is missing mocks.

The RWMutex contract is preserved: the send still runs under
outChanMu.RLock, so CloseOutChan (write-lock holder) can't
interleave a close between the not-closed check and the send. The
only new cost is CloseOutChan may now wait up to sendBudget for
in-flight sends — imperceptible vs. a process shutdown, and every
concurrent sender was already racing the same RLock.

Fixes a recurring Redis loss-rate / channel-race flake on
keploy/enterprise's CI where the tests feed syncMock via
SetOutputChannel and never call SetFirstRequestSignaled, so every
recorded mock flows through this path.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* fix(syncMock): plumb logger, atomic.Uint64 drop counter, drop-path tests

Addresses the Copilot feedback on the bounded-send PR:

- Plumb *zap.Logger through SyncMockManager (SetLogger) and log the
  overflow via that logger instead of zap.L(). syncMock is a
  package-level singleton that loads before any zap.ReplaceGlobals,
  so zap.L() silently fell back to Nop and the drop Warn never
  reached operators. Proxy.New now wires the proxy logger in once.
- Promote the sampled drop log from Warn to Error with a shorter,
  actionable message so overflow is loud and grep-friendly.
- Switch dropCount from bare uint64 + atomic.AddUint64 to
  sync/atomic.Uint64 so the 32-bit alignment requirement becomes a
  compile-time property instead of a field-ordering footgun.
- Export DropCount() for tests and external observability.
- Add regression tests covering: no-drop when the consumer drains
  within sendBudget, drop + Error-log + dropCount increment past
  budget, sampling cadence (n=1, 1024, 2048), nil-logger Nop
  fallback, and concurrent-increment atomicity under load.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* ci(windows/go-dedup): robust app-ready probe + per-request retry

The Windows go-http-no-deps record phase on #4076 run
24629542035/job/72014589521 hit a 0-tests-recorded failure because
the readiness probe broke on a single 200 response, and that first
200 came from a transient Docker Desktop port-publish stub rather
than the in-container Go binary. Four seconds later the real
traffic got "Unable to connect to the remote server" on the first
request, and the single try/catch wrapping all six calls swallowed
the error with $sent=0. No test-set-0 directory ever formed, and
the script exited with "Test directory not found".

Three-part fix, applied identically to the docker and native
variants under golang/go-dedup:

1. Readiness probe now requires $stabilityCount (=3) consecutive
   200 responses with a 1s gap before declaring ready. A single
   flap from a port-forwarder stub can no longer race real traffic.
2. After the probe succeeds, settle for $settleSec (=5) before
   firing traffic so keploy's recording-proxy interception layer
   has time to fully wire up against the container's listener.
3. Replace the all-in-one try/catch around the six HTTP calls with
   an Invoke-WithRetry helper (5 attempts, 1.5x exponential
   backoff). A transient connection-refused on call #1 no longer
   aborts calls #2..#6.

If the probe never reaches the stability threshold inside the
deadline, the script still proceeds so downstream record
validation can surface the failure with its own clearer error
message, instead of us silently pretending the app was ready.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

* ci(windows/go-dedup): literal port substitution + patched YAML dump

Root cause of the #4076 run 24629876059 failure: the
`(?m)("?)8080:8080\1` backref regex intermittently produces a
malformed port line on Windows PowerShell 5.1 — the leading quote
is stripped while the trailing one is kept, yielding
`- 54652:8080"`. docker-compose parses that as `containerPort:
8080"` and rejects it with `invalid containerPort: 8080"`. No
container starts, every HTTP call gets ECONNREFUSED, and the
record phase records zero tests.

My earlier readiness-probe + retry work on this PR correctly
surfaced the downstream symptom (consecutive 200s never observed,
all six traffic requests exhaust their retry budget), but it was
papering over the actual bug: the YAML itself was broken before
docker-compose ever ran. No amount of client-side resilience can
recover from a container that was never started.

This change:

1. Replaces the backref regex with a plain `.Replace("8080:8080",
   "<port>:8080")` — the samples-go/go-dedup compose file uses the
   double-quoted short form exclusively, so literal substitution
   is both simpler and provably correct.
2. Validates that the expected port fragment is present after
   substitution, throwing loudly with the unpatched YAML printed
   if the sample ever introduces a form the script doesn't
   handle. Silent fall-through to an unpatched run was what let
   this slip undetected for so long.
3. Dumps the patched YAML to the job log so any future
   containerPort / hostPort diagnostic lands with the exact text
   docker-compose parsed — no more guessing whether the regex
   behaved.

Native-windows script is left untouched — it runs the Go binary
directly, not dockerized, and does not hit this code path.

Signed-off-by: slayerjain <shubhamkjain@outlook.com>

---------

Signed-off-by: slayerjain <shubhamkjain@outlook.com>