web: live event stream disconnects mid-session on mobile (SSE doesn't auto-recover reliably) · Issue #5 · bencode/baton · GitHub
Skip to content

web: live event stream disconnects mid-session on mobile (SSE doesn't auto-recover reliably) #5

Description

@bencode

Goal

The live event stream (SSE) should stay connected — or transparently recover — through normal mobile usage (background/foreground, screen lock, Wi-Fi ↔ cellular handoff). Today the user sees the yellow Live connection lost, reconnecting… banner mid-session on mobile, even with the session still active.

Symptom

iOS, mid-session, the banner appears on its own (no user action). Reproduced this session — see attached.

⚠ Live connection lost, reconnecting… new messages may not appear immediately

Messages queued during the drop are delivered after reconnect (the QUEUED · N indicator works correctly), so no data loss observed — but the banner sits there and the user can't tell whether the session is still healthy.

Root cause (current implementation)

packages/web/src/features/sessions/use-session-stream.ts:

const es = new EventSource(api.sessionStreamUrl(sessionId))
es.onopen  = () => setStatus('open')
es.onerror = () => setStatus('error')   // ← once set, only the next onopen clears it
  • Relies entirely on the browser's built-in EventSource auto-retry.
  • Never force-recreates the connection on visibility/online events.
  • No timeout-based forced reconnect — if the EventSource sticks in CONNECTING indefinitely (e.g. mobile suspended JS, proxy black-holed the socket), the banner is stuck and the only fix is reloading.

Server side already keepalives every 30s (packages/server/src/sse.ts: keepalive\n\n), so the issue isn't idle-timeout from our end — it's almost certainly one of:

  1. iOS Safari backgrounding — when the page is hidden / app backgrounded, iOS suspends JS and frequently closes the EventSource socket. On foreground the browser may not retry promptly.
  2. Network handoff — Wi-Fi ↔ cellular drops the TCP connection without a clean signal; spec-defined 3s retry kicks in, but it may race with the network being mid-transition.
  3. Intermediary HTTP/2 or proxy timeouts — even with 30s server keepalive, some carrier proxies idle-close anyway.

Proposed fix

Make recovery explicit instead of leaving it to the browser:

  1. On visibilitychangevisible and on online events, force-close the current EventSource and create a new one (don't trust auto-retry on mobile).
  2. After ≥ N seconds (e.g. 15s) in error / connecting without a successful onopen, force close + recreate.
  3. Surface the 'connecting' vs 'error' distinction in the banner — "reconnecting…" should auto-dismiss within seconds; only a sustained drop (≥ ~10s) is a real warning.
  4. (Optional) lastEventId already tracked by EventSource — confirm the server honours Last-Event-ID on reconnect to fill the gap; if it doesn't, the on-open history fetch in use-session-stream still backstops via mergeEvents, but it would mean a brief gap window.

Verification

  • On iOS Safari: switch the page to background for 30 s and back — banner clears within ≤ 5 s without manual reload.
  • Toggle Wi-Fi off → on with the session open — banner appears briefly and clears once network is back; no manual reload needed.
  • Throttle the EventSource (DevTools / mobile network drop) for 20 s — banner shows; on restore, status flips to open and queued messages flow.
  • No regression on desktop (banner still surfaces only on a sustained drop, not on every transient blip).
  • Unit / hook test for useSessionStream covers the visibility-based reconnect (mock document.visibilityState toggling, assert a new EventSource is constructed).

Refs

  • Stream hook: packages/web/src/features/sessions/use-session-stream.ts
  • Banner: packages/web/src/features/sessions/session-detail/connection-banner.tsx
  • Server keepalive: packages/server/src/sse.ts (already 30 s : keepalive)
  • Server route: packages/server/src/routes/sessions.ts:316-332 (/sessions/:id/stream)
  • Related (separate UX issue, possibly same family): the QUEUED indicator already proves the message-queueing fallback is correct; this issue is purely about restoring the live tail.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions