feat: Add AI-native observability and self-recovery module

## Problem

The agent has no observability or recovery module. Four gaps:

1. **No deployment health.** Nothing tracks whether a deployment's containers
   are running. Agent health is independent of deployment state, so a dead
   deployment produces no signal.
2. **No recovery.** On startup the agent starts only the file watcher and API
   server with no deployment reconcile (`cmd/agent/main.go:21-83`); on shutdown
   it stops only the API server. Recovery after a host or Docker daemon restart
   depends entirely on each container's `restart:` policy
   (`agent/templates/*/docker-compose.yml`). When that does not fire (compose
   file without it, missing external network, image that will not pull at boot),
   deployments stay down with no auto-heal.
3. **No notifications.** No mechanism exists to alert an operator to anything.
   Email is used only for registry credentials and certbot.
4. **Resource visibility is thin.** Host CPU and memory are read point-in-time
   (`internal/system/stats.go`); there is no per-container monitoring and no
   spike detection.

Constraint: recovery must not restart deployments the user intentionally
stopped. Reconcile against tracked desired state, not a blind `up` of every
directory.

## Why it matters

A reboot or a resource spike becomes a silent outage. The first signal is a
user reporting the site is down.

## Proposal

One observability and recovery module that reuses the existing AI action engine.

1. **Health monitoring.** Poll per-deployment container state (running / exited
   / restarting, health, exit codes), expose via API.
2. **Resource monitoring.** Extend `internal/system/stats.go` with per-container
   CPU and memory and configurable spike thresholds.
3. **Notifications.** A pluggable channel subsystem, email first, interface open
   for webhook and others. Notify on ill-health, resource spikes, and
   recovery-from-shutdown events.
4. **AI-native recovery.** At boot and on detected ill-health, reconcile
   deployments to desired state. Drive remediation through the existing
   `service_action` suggestion engine (`internal/ai/suggestions.go`:
   restart / rebuild / pull) so the recovery is diagnosis-led, not blind.
   Emit a recovery event when deployments are brought back after a restart.

## Action

- [ ] Per-deployment health checker exposed via API
- [ ] Per-container resource monitoring with configurable spike thresholds
- [ ] Notification subsystem: email channel plus a pluggable interface for more
- [ ] Boot-time and on-failure reconcile to desired state; skip intentionally-stopped deployments
- [ ] Emit and notify recovery events (for example "recovered N deployments after host restart")
- [ ] Route remediation through the AI service_action engine

## Acceptance criteria

- After a host reboot, deployments self-recover without manual intervention and a recovery notification is sent.
- Operators are notified on deployment ill-health and on resource spikes.


Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add AI-native observability and self-recovery module #148

Problem

Why it matters

Proposal

Action

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

feat: Add AI-native observability and self-recovery module #148

Description

Problem

Why it matters

Proposal

Action

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions