feat: Add AI-native observability and self-recovery module · Issue #148 · flatrun/agent · GitHub
Skip to content

feat: Add AI-native observability and self-recovery module #148

Description

@nfebe

Problem

The agent has no observability or recovery module. Four gaps:

  1. No deployment health. Nothing tracks whether a deployment's containers
    are running. Agent health is independent of deployment state, so a dead
    deployment produces no signal.
  2. No recovery. On startup the agent starts only the file watcher and API
    server with no deployment reconcile (cmd/agent/main.go:21-83); on shutdown
    it stops only the API server. Recovery after a host or Docker daemon restart
    depends entirely on each container's restart: policy
    (agent/templates/*/docker-compose.yml). When that does not fire (compose
    file without it, missing external network, image that will not pull at boot),
    deployments stay down with no auto-heal.
  3. No notifications. No mechanism exists to alert an operator to anything.
    Email is used only for registry credentials and certbot.
  4. Resource visibility is thin. Host CPU and memory are read point-in-time
    (internal/system/stats.go); there is no per-container monitoring and no
    spike detection.

Constraint: recovery must not restart deployments the user intentionally
stopped. Reconcile against tracked desired state, not a blind up of every
directory.

Why it matters

A reboot or a resource spike becomes a silent outage. The first signal is a
user reporting the site is down.

Proposal

One observability and recovery module that reuses the existing AI action engine.

  1. Health monitoring. Poll per-deployment container state (running / exited
    / restarting, health, exit codes), expose via API.
  2. Resource monitoring. Extend internal/system/stats.go with per-container
    CPU and memory and configurable spike thresholds.
  3. Notifications. A pluggable channel subsystem, email first, interface open
    for webhook and others. Notify on ill-health, resource spikes, and
    recovery-from-shutdown events.
  4. AI-native recovery. At boot and on detected ill-health, reconcile
    deployments to desired state. Drive remediation through the existing
    service_action suggestion engine (internal/ai/suggestions.go:
    restart / rebuild / pull) so the recovery is diagnosis-led, not blind.
    Emit a recovery event when deployments are brought back after a restart.

Action

  • Per-deployment health checker exposed via API
  • Per-container resource monitoring with configurable spike thresholds
  • Notification subsystem: email channel plus a pluggable interface for more
  • Boot-time and on-failure reconcile to desired state; skip intentionally-stopped deployments
  • Emit and notify recovery events (for example "recovered N deployments after host restart")
  • Route remediation through the AI service_action engine

Acceptance criteria

  • After a host reboot, deployments self-recover without manual intervention and a recovery notification is sent.
  • Operators are notified on deployment ill-health and on resource spikes.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions