# Home Monitoring Suite -- Project Reference

## Overview

Infrastructure monitoring suite for the Vanlint home lab. Custom agentless Python exporters collect metrics from all monitored systems via REST APIs and SSH, then expose them as Prometheus metrics on a single HTTP endpoint. Prometheus scrapes and stores the time series, Grafana provides dashboards and alerting, and Uptime Kuma provides independent uptime/reachability monitoring. Home Assistant integration surfaces key metrics and alerts on the HA dashboard with push notifications.

**Current version:** 0.1.0
**Deployment target:** claude-internal-tools (VM 209, ###_IP) — pending migration to LXC

## Stack

| Layer | Technology |
|---|---|
| Language | Python 3.13 |
| Metrics Library | prometheus_client 0.21.1 |
| HTTP Client (REST APIs) | requests 2.32.3, urllib (stdlib) |
| SSH Client | paramiko 3.5.1 |
| Time Series DB | Prometheus (14d retention) |
| Dashboards | Grafana OSS (provisioned) |
| Uptime Monitoring | Uptime Kuma |
| Dashboard Backend | aiohttp (async HTTP server) |
| Dashboard Frontend | Alpine.js (reactivity), ECharts (charts), vanilla CSS |
| Container Runtime | Docker Compose (6 services) + systemd (investigator) |
| HA Integration | REST sensors + automations (YAML, manual copy) |
| Uptime Kuma Setup | uptime-kuma-api 1.2.1 |

## Codebase Stats

| Metric | Count |
|---|---|
| Source files (exporters/) | 15 Python modules |
| Source lines (exporters/) | ~3,725 |
| Source files (investigator/) | 13 Python modules |
| Source files (dashboard/) | 15 Python modules |
| Source lines (dashboard/) | ~1,503 |
| Frontend files (dashboard/static/) | 9 files (HTML, CSS, JS, SVG) |
| Frontend lines (dashboard/static/) | ~3,443 |
| Scripts | 1 (setup-uptime-kuma.py, ~511 lines) |
| Test files | 31 + conftest.py |
| Alert rule files | 10 (node, storage, backup, docker, service, meta, cert, proxy, app_health, docker_health) |
| Runbook files | 9 YAML + index |
| Grafana dashboards | 7 (JSON, provisioned) |
| Uptime Kuma monitors | 30 (provisioned via script) |
| HA sensors | 33 REST sensors (30 base + 3 AI investigation) |
| HA automations | 6 (5 base + 1 AI investigation) — all deployed and active |

## Project Structure

```
/projects/internal-apps/home-monitoring/
    PROJECT_REFERENCE.md        # This file
    pyproject.toml              # Ruff, Pyright config (Python 3.13)
    requirements.txt            # Pinned production deps
    Dockerfile.exporters        # Python 3.13-slim, exporters only
    requirements-investigator.txt # Pinned investigator deps (aiohttp, pydantic, etc.)
    docker-compose.yml          # 6 services: exporters, prometheus, grafana, uptime-kuma, alertmanager, dashboard
    .env.example                # Env var template (HM_ prefix + HM_AI_ for investigator)
    .gitignore                  # Standard Python/Docker/IDE ignores
    Dockerfile.dashboard        # Python 3.13-slim, dashboard backend + static files
    requirements-dashboard.txt  # Pinned dashboard deps (aiohttp)
    dashboard/
        __init__.py             # Package init, setup_logging()
        server.py               # aiohttp entry point (port 8080), /health, /, /static/, /api/*
        prometheus_client.py    # Async Prometheus HTTP API client (query + query_range)
        cache.py                # TTLCache: in-memory key/value with per-entry TTL
        api.py                  # Route registration: register_routes() → 11 GET routes
        api_overview.py         # /api/overview — global health KPIs
        api_pve.py              # /api/pve — nodes, guests, HA, replication
        api_storage.py          # /api/storage — NAS pools/disks/services + PBS datastores/backups
        api_docker.py           # /api/docker — per-host container and resource summary
        api_network.py          # /api/network — gateways, interfaces, Tailscale, PF, WiFi, DNS, Postfix
        api_energy.py           # /api/energy — Powerwall battery, solar, grid, home
        api_media.py            # /api/media — Plex, Sonarr, Radarr, SABnzbd, Tdarr
        api_certs.py            # /api/certs — TLS certificate expiry (300s cache TTL)
        api_alerts.py           # /api/alerts + /api/investigations — Alertmanager + investigator proxy
        api_timeseries.py       # /api/timeseries — validated range query passthrough
        static/
            index.html          # Single-page dashboard (Alpine.js reactive)
            css/
                variables.css   # Design tokens, font-faces, custom properties
                layout.css      # Grid layout, responsive breakpoints
                components.css  # Panel cards, status dots, tables, badges
                animations.css  # Pulse, fade, shimmer animations
            js/
                dashboard.js    # Alpine.js component: data fetching, state, polling
                charts.js       # ECharts wrappers: CPU, network, energy time-series
                utils.js        # Formatters: bytes, uptime, relative time
            img/
                favicon.svg     # Monitor icon favicon
            vendor/
                alpine.min.js   # Alpine.js 3.x (vendored, LAN-only)
                echarts.min.js  # Apache ECharts 5.x (vendored, LAN-only)
            fonts/
                ibm-plex-sans-regular.woff2
                ibm-plex-sans-medium.woff2
                ibm-plex-sans-semibold.woff2
                jetbrains-mono-regular.woff2
                jetbrains-mono-medium.woff2
    config/
        topology.yml            # Externalised infrastructure topology (Docker hosts, NAS services, critical guests)
    exporters/
        __init__.py
        server.py               # HTTP server entry point (port 9100), /metrics + /health endpoints,
                                #   CompositeCollector with scrape state tracking, SSH self-test on startup
        common.py               # Shared utilities: PVEClient, PBSClient, OPNsenseClient,
                                #   SSHClient, load_topology(), SSL context, collector_error_handler decorator
        pve_exporter.py         # PVECollector: cluster, nodes, guests, HA, replication, storage
        nas_exporter.py         # NASCollector: ZFS pools, disks, services (via SSH)
        pbs_exporter.py         # PBSCollector: datastores, backup tasks, verify, GC, sync (REST API)
        docker_exporter.py      # DockerCollector: containers, host resources across 5 hosts
        opnsense_exporter.py    # OPNsenseCollector: interfaces, gateways, PF stats, Tailscale
        proxy_exporter.py       # ProxyCollector: dual-path (WAN+Tailscale) reachability probes + NPM host inventory
        app_health_exporter.py  # AppHealthCollector: config-driven HTTP health probes for all applications
    investigator/                   # AI-powered alert investigation package
        __init__.py                 # Package init, setup_logging()
        __main__.py                 # Entry point: asyncio.run(main())
        config.py                   # InvestigatorConfig dataclass, from_env()
        schemas.py                  # Pydantic: AlertmanagerPayload, TriageResult, InvestigationResult
        webhook.py                  # aiohttp server: /webhook/alertmanager, /metrics, /health, /api
        cli_client.py               # Claude CLI subprocess client (claude -p)
        tools.py                    # Investigation prompt template
        investigator.py             # InvestigationOrchestrator: full pipeline
        runbook_loader.py           # Load YAML runbooks, match to alert names
        notifier.py                 # Email via aiosmtplib + HA webhook POST
        history.py                  # SQLite investigation history (CRUD + stats)
        metrics.py                  # Prometheus counters/gauges/histograms for AI pipeline
        cooldown.py                 # Per-fingerprint cooldown tracker
        safety.py                   # Always-escalate list, blocked target constants
    runbooks/                       # Static investigation guides
        _index.yml                  # Alert-name -> runbook-file mapping
        generic.yml                 # Fallback for unmapped alerts
        pve_node_down.yml           # + 7 more alert-specific runbooks
    scripts/
        setup-uptime-kuma.py    # Uptime Kuma monitor provisioning + reconcile (30 monitors)
    config/
        prometheus/
            prometheus.yml      # Scrape config: custom-exporters (30s), self-monitoring
            recording_rules.yml # Pre-computed recording rules (ratios, aggregations)
            alerts/
                node_alerts.yml     # PVE node/cluster/guest/HA/replication alerts
                storage_alerts.yml  # NAS pool/disk + PBS datastore alerts
                backup_alerts.yml   # PBS backup freshness, verify, sync + ZFS replication lag
                docker_alerts.yml   # Container status/restarts/resources + host disk/memory
                service_alerts.yml  # OPNsense WAN/gateway/interface + Tailscale peer alerts
                meta_alerts.yml     # Monitoring stack self-health (scrape, storage, Alertmanager)
                cert_alerts.yml     # TLS certificate expiry (warning, soon, expired)
                proxy_alerts.yml    # External proxy WAN/Tailscale path down, degraded, slow
                app_health_alerts.yml   # Application HTTP health probe alerts
                docker_health_alerts.yml # Container healthcheck, OOM, restart loop, log bloat alerts
        alertmanager/
            alertmanager.yml    # Webhook receiver, grouping, inhibition rules
        grafana/
            provisioning/
                datasources/
                    prometheus.yml  # Prometheus datasource (auto-provisioned)
                dashboards/
                    dashboards.yml  # Dashboard file provider config
            dashboards/
                home-overview.json      # Top-level infrastructure summary (redesigned)
                cluster-overview.json   # PVE cluster detail (nodes, guests, HA, replication)
                storage-health.json     # NAS + PBS storage and backup health
                docker-hosts.json       # Docker container and host metrics
                network-firewall.json   # OPNsense network, gateway, PF, Tailscale
                ai-investigations.json  # AI investigation pipeline monitoring
                application-health.json # Application HTTP health probes + external proxy status
    ha-integration/
        sensors.yaml            # HA REST sensors querying Prometheus API (30 sensors)
        automations.yaml        # HA automations: critical/high alerts, backup overdue,
                                #   pool health, daily summary (5 automations)
        lovelace-dashboard.yaml # HA Lovelace dashboard: Infrastructure Monitoring view
        ai-investigation-sensors.yaml     # 3 REST sensors for investigator API
        ai-investigation-automations.yaml # 1 automation: investigation complete
    tests/
        __init__.py
        conftest.py             # Shared fixtures, mock data, helper functions
        test_common.py          # Tests for common.py utilities and clients
        test_pve_exporter.py    # Tests for PVECollector
        test_nas_exporter.py    # Tests for NASCollector
        test_pbs_exporter.py    # Tests for PBSCollector
        test_docker_exporter.py # Tests for DockerCollector
        test_opnsense_exporter.py # Tests for OPNsenseCollector
        test_investigator_config.py  # Tests for InvestigatorConfig
        test_schemas.py              # Tests for Pydantic schemas
        test_safety.py               # Tests for safety guardrails
        test_cooldown.py             # Tests for cooldown tracker
        test_cli_client.py           # Tests for Claude CLI subprocess client
        test_history.py              # Tests for SQLite history
        test_runbook_loader.py       # Tests for runbook loader
        test_metrics.py              # Tests for Prometheus metrics
        test_webhook.py              # Tests for aiohttp webhook server
        test_notifier.py             # Tests for email/HA notifications
        test_investigator_orchestrator.py # Tests for investigation orchestrator
        test_server_integration.py       # Integration tests for CompositeCollector + register_collectors
        test_dashboard_cache.py              # Tests for TTLCache (9 tests)
        test_dashboard_api.py                # Tests for overview, PVE, storage, docker handlers (8 tests)
        test_dashboard_server.py             # Integration tests for server, energy, certs, media, timeseries, alerts (11 tests)
        test_dashboard_validation.py     # Grafana dashboard JSON validation (24 tests)
        test_alertmanager_config.py      # Alertmanager config validation (9 tests)
```

## Architecture

### Agentless Collection Model

All metrics are collected without installing agents on monitored hosts. The exporter container reaches out to each system using its native API or SSH:

```
                          +-------------------+
                          |  Prometheus :9090  |<-- scrapes /metrics every 30s
                          +-------------------+
                                    |
                          +-------------------+
                          | Exporters  :9100  |  <-- single HTTP server, all collectors
                          +-------------------+
                           /    |    |    \    \
                     REST API  SSH   REST  SSH  SSH
                       |        |     |    |     |
                    PVE (5)  NAS   OPN  PBS  Docker (5)
                   urllib    prmk  req  prmk  paramiko
```

**Collection methods per system:**

| System | Method | Client | Auth |
|---|---|---|---|
| PVE cluster (5 nodes) | REST API | `PVEClient` (urllib) | API token in header |
| NAS (Debian 13) | SSH + CLI | `SSHClient` (paramiko) | SSH key |
| OPNsense (REST) | REST API | `OPNsenseClient` (requests) | Base64 Basic auth |
| OPNsense (Tailscale) | SSH | `SSHClient` (paramiko, opnsense_mode) | SSH key |
| PBS | REST API | `PBSClient` (urllib) | API token in header |
| Docker hosts (5) | SSH + CLI | `SSHClient` (paramiko) | SSH key |

### Prometheus + Grafana + Uptime Kuma

- **Prometheus** scrapes the custom-exporters endpoint every 30s with a 25s timeout. Alert rules are evaluated every 30s-120s depending on the rule group. Recording rules pre-compute ratios and aggregations. Alertmanager routes alerts to the AI investigator webhook.
- **Grafana** is auto-provisioned with the Prometheus datasource and 4 dashboards. SMTP alerting configured via ancillary-vm Postfix relay.
- **Uptime Kuma** provides independent uptime monitoring (ping, HTTP, DNS, port checks). 30 monitors provisioned via `scripts/setup-uptime-kuma.py`. Separate from Prometheus -- if exporters fail, Kuma still monitors reachability.

### Home Assistant Integration

REST sensors in `ha-integration/` query the Prometheus HTTP API on claude-internal-tools:9090. These are manually copied to the HA configuration directory. Automations provide push notifications for critical/high alerts, backup overdue, pool degradation, and a daily infrastructure summary.

## Configuration

### Environment Variables (HM_ prefix)

All application config is read from environment variables with the `HM_` prefix. Copy `.env.example` to `.env` and fill in actual values.

| Variable | Purpose | Default |
|---|---|---|
| `HM_EXPORTER_PORT` | Exporter HTTP server port | `9100` |
| `HM_LOG_LEVEL` | Logging level | `INFO` |
| `HM_GRAFANA_ADMIN_USER` | Grafana admin username | `admin` |
| `HM_GRAFANA_ADMIN_PASSWORD` | Grafana admin password | (required) |
| `HM_KUMA_ADMIN_USER` | Uptime Kuma admin username | `admin` |
| `HM_KUMA_ADMIN_PASSWORD` | Uptime Kuma admin password | (required) |
| `HM_PVE_API_BASE` | PVE API base URL | `https://###_IP:8006/api2/json` |
| `HM_PVE_TOKEN_ID` | PVE API token ID | `root@pam!claude-code` |
| `HM_PVE_TOKEN_SECRET` | PVE API token secret | (required) |
| `HM_NAS_SSH_HOST` | NAS SSH host | `###_IP` |
| `HM_NAS_SSH_USER` | NAS SSH user | `root` |
| `HM_OPNSENSE_API_BASE` | OPNsense API base URL | `https://###_IP/api` |
| `HM_OPNSENSE_API_KEY` | OPNsense API key | (required) |
| `HM_OPNSENSE_API_SECRET` | OPNsense API secret | (required) |
| `HM_PBS_API_BASE` | PBS REST API base URL | `https://###_IP:8007/api2/json` |
| `HM_PBS_TOKEN_ID` | PBS API token ID | (required) |
| `HM_PBS_TOKEN_SECRET` | PBS API token secret (UUID) | (required) |
| `HM_PBS_TOKEN` | Pre-assembled PBS auth header (alternative) | (optional) |
| ~~`HM_PBS_HOST`~~ | Deprecated — PBS SSH host (replaced by REST API) | — |
| ~~`HM_PBS_USER`~~ | Deprecated — PBS SSH user (replaced by REST API) | — |
| `HM_DOCKER_DEV_HOST` | Docker dev host IP | `###_IP` (claude-dev DECOMMISSIONED 2026-04-24) |
| `HM_DOCKER_PROD_HOST` | Docker prod host IP | `###_IP` (claude-prod DECOMMISSIONED 2026-04-24) |
| `HM_DOCKER_INTERNAL_HOST` | Docker internal-tools host IP | `###_IP` |
| `HM_DOCKER_NAS_HOST` | Docker NAS host IP | `###_IP` |
| `HM_DOCKER_MEDIA_HOST` | Docker media-vm host IP | `###_IP` |
| `HM_DOCKER_USER` | Docker hosts SSH user | `root` |
| `HM_SSH_KEY_PATH` | SSH private key path | `/root/.ssh/id_ed25519` |
| `HM_OPNSENSE_SSH_HOST` | OPNsense SSH host | `###_IP` |
| `HM_OPNSENSE_SSH_USER` | OPNsense SSH user | `root` |
| `HM_SMTP_HOST` | SMTP relay host | `###_IP` |
| `HM_SMTP_PORT` | SMTP relay port | `25` |
| `HM_SMTP_FROM` | Alert email sender | `claude@###_DOMAIN` |
| `HM_ALERT_EMAIL` | Alert email recipient | `stuart@###_DOMAIN` |
| `HM_PROXY_WAN_IP` | External proxy public IP | `###_IP` |
| `HM_PROXY_TAILSCALE_IP` | External proxy Tailscale IP | `###_IP` |
| `HM_PROXY_CANARY_URL` | HTTPS canary URL for :443 probe | `https://ha.###_DOMAIN` |
| `HM_PROXY_PROBE_TIMEOUT` | HTTP probe timeout seconds | `5` |
| `HM_NPM_EXTERNAL_EMAIL` | NPM admin email for API auth | (required) |
| `HM_NPM_EXTERNAL_PASSWORD` | NPM admin password for API auth | (required) |
| `HM_AI_WEBHOOK_SECRET` | Shared secret for Alertmanager webhook auth | (empty = disabled) |

**Note:** The PVE token is assembled as `PVEAPIToken=${HM_PVE_TOKEN_ID}=${HM_PVE_TOKEN_SECRET}` in the `PVEClient`. The token ID in use is `root@pam!claude-code`. The `.env` file contains `HM_PVE_TOKEN` as the pre-assembled `Authorization` header value.

## Deployment

### Target

- **Host:** claude-internal-tools (VM 209, ###_IP)
- **Deploy command:** `/projects/scripts/deploy.sh home-monitoring internal`

### First-Time Setup

1. Create `.env` from `.env.example` with actual credentials
2. Deploy: `/projects/scripts/deploy.sh home-monitoring internal`
3. Wait for containers to start (~30s)
4. Create Uptime Kuma admin user via web UI at `http://###_IP:3001`
5. Run monitor provisioning:
   ```bash
   HM_KUMA_ADMIN_###_REDACTED python scripts/setup-uptime-kuma.py
   # Or reconcile existing monitors (create missing, update changed, report orphans):
   HM_KUMA_ADMIN_###_REDACTED python scripts/setup-uptime-kuma.py --reconcile
   # With --delete-orphans to remove monitors not in MONITORS list:
   HM_KUMA_ADMIN_###_REDACTED python scripts/setup-uptime-kuma.py --reconcile --delete-orphans
   # Dry-run mode (log actions without making changes):
   HM_KUMA_ADMIN_###_REDACTED python scripts/setup-uptime-kuma.py --reconcile --dry-run
   ```
6. Access Grafana at `http://###_IP:3000` (dashboards auto-provisioned)
7. Copy `ha-integration/*.yaml` files to Home Assistant config directory manually

### Services (Docker Compose)

| Service | Container Name | Port | Memory Limit | Image | Notes |
|---|---|---|---|---|---|
| exporters | home-monitoring-exporters | 9100:9100 | 256 MB | Custom (Dockerfile.exporters) | Python 3.13-slim, all collectors, `/health` + `/metrics` |
| prometheus | home-monitoring-prometheus | 9090:9090 | 512 MB | prom/prometheus:3.10.0 | 14d retention, lifecycle API |
| grafana | home-monitoring-grafana | 3000:3000 | 384 MB | grafana/grafana-oss:12.4.0 | Auto-provisioned datasource + dashboards |
| uptime-kuma | home-monitoring-uptime-kuma | 3001:3001 | 256 MB | louislam/uptime-kuma:1 | Persistent data volume |
| alertmanager | home-monitoring-alertmanager | 9093:9093 | 128 MB | prom/alertmanager:0.31.1 | Routes alerts to investigator webhook |
| dashboard | home-monitoring-dashboard | 8080:8080 | 128 MB | Custom (Dockerfile.dashboard) | aiohttp API + Alpine.js/ECharts SPA, `/health` + `/api/*` |
| investigator | home-monitoring-investigator | 8099 | 512 MB | systemd (host) | AI investigation via `claude -p` CLI |

All services have `restart: unless-stopped`, JSON-file log rotation (10m/3 files), named volumes for persistent data, and Docker healthchecks.

### Port Allocation

| Port | Service | Access |
|---|---|---|
| 9100 | Custom Exporters | Prometheus scrape (internal) |
| 9090 | Prometheus | Web UI + API |
| 9093 | Alertmanager | Web UI + webhook receiver |
| 8080 | Dashboard | Bespoke HUD (proxied via `monitor.###_DOMAIN`) |
| 8099 | AI Investigator | Webhook + metrics + API |
| 3000 | Grafana | Dashboards |
| 3001 | Uptime Kuma | Status page + admin |

## Development Commands

### Lint

```bash
ruff check exporters/ investigator/ dashboard/ tests/
ruff format --check exporters/ investigator/ dashboard/ tests/
```

### Type Check

```bash
pyright exporters/ dashboard/
```

### Tests

```bash
pytest tests/ -v
```

### All Checks

```bash
ruff check exporters/ dashboard/ tests/ && ruff format --check exporters/ dashboard/ tests/ && pyright exporters/ dashboard/ && pytest tests/ -v
```

### YAML Lint (config files)

```bash
yamllint config/ ha-integration/
```

## Git Commands

```bash
# All git operations use -C from claudes-palace
git -C /projects/internal-apps/home-monitoring status
git -C /projects/internal-apps/home-monitoring add -A
git -C /projects/internal-apps/home-monitoring commit -m "message"
git -C /projects/internal-apps/home-monitoring log --oneline -10
```

## Monitored Systems

### PVE Cluster (5 nodes)

| Node | IP | Metrics |
|---|---|---|
| node-4 | ###_IP | CPU, memory, disk, uptime, guests |
| vanlint-ha-1 | ###_IP | CPU, memory, disk, uptime, guests |
| vanlint-ha-2 | ###_IP | CPU, memory, disk, uptime, guests |
| vanlint-ha-3 | ###_IP | CPU, memory, disk, uptime, guests |
| vanlint-ha-4 | ###_IP | CPU, memory, disk, uptime, guests |

**Additional cluster metrics:** Quorum status, HA resource states (7 resources), ZFS replication jobs (7 jobs), cluster storage overview.

**API endpoints:** `/cluster/status`, `/cluster/resources`, `/cluster/ha/resources`, `/nodes/{node}/replication`, `/storage`

### NAS (Debian 13)

- ZFS pool health, used/free/fragmentation, scrub errors/timestamps
- Disk temperatures and SMART status (per physical disk, via smartctl -j)
- Systemd service status (smbd, nfs-server, docker, target, sshd)

**SSH commands:** `zpool list -Hp`, `zpool status`, `smartctl -A/-H /dev/sdX -j`, `lsblk -dpno NAME,TYPE`, `systemctl is-active`

### Proxmox Backup Server (PBS)

- Datastore capacity (via REST API `/admin/datastore/{store}/status`)
- Backup task history: last success timestamp, duration per guest
- Task status counts (backup/verify/gc/sync OK vs error)
- Verify errors, GC run timestamps, sync job freshness

**API endpoints:** `GET /admin/datastore`, `GET /admin/datastore/{store}/status`, `GET /nodes/localhost/tasks?limit=100&typefilter=backup,verify,garbage_collection,syncjob`

### OPNsense Firewall/Router

- Interface statistics (bytes in/out, errors, up/down status)
- Gateway status (online/offline, RTT, packet loss)
- PF firewall stats (state table current/limit)
- Tailscale peer status (total/online peers, per-peer status)

**API endpoints:** `/diagnostics/interface/getInterfaceStatistics`, `/routes/gateway/status`, `/diagnostics/firewall/pf_statistics`

**SSH (Tailscale):** `TAILSCALE_SOCKET=/var/run/tailscale/tailscaled.sock tailscale status --json`

### Docker Hosts (4)

| Host | IP | Role |
|---|---|---|
| claude-internal-tools | ###_IP | Internal tools (including this suite) — pending LXC migration |
| nas | ###_IP | NAS media stack (Sonarr, Radarr, SABnzbd, Tdarr, TVHeadend, Antennas) |
| media-vm | ###_IP | Media services (Plex, Overseerr, Tautulli) |
| email-bot-arn | ###_IP | Email bot production (ARN) |

**Per-container metrics:** Running status, CPU %, memory usage/limit, restart count
**Per-host metrics:** Total/running containers, root disk usage, memory usage, image count/size

**SSH commands:** `docker ps -a --format '{{json .}}'`, `docker stats --no-stream --format '{{json .}}'`, `df -B1 /`, `free -b`, `docker image ls --format '{{json .}}'`

## Alert Rules (10 files, evaluated by Prometheus)

| File | Alert Count | Key Alerts |
|---|---|---|
| `node_alerts.yml` | 11 | PVE node down, cluster quorum lost, guest down, HA not started, replication failed/stale |
| `storage_alerts.yml` | 7 | NAS pool degraded, SMART failure, disk hot, pool >80%, scrub errors, PBS datastore >85% |
| `backup_alerts.yml` | 6 | Backup stale >26h, missing >48h, verify errors, sync failure, ZFS replication lag (critical VMs vs others) |
| `docker_alerts.yml` | 7 | Container not running, restarting, high memory/CPU, host disk/memory high, zero running containers |
| `service_alerts.yml` | 8 | WAN down, high RTT/packet loss, Tailscale peers reduced/down/none, interface down/errors |
| `meta_alerts.yml` | 4 | Exporter scrape slow, Prometheus storage near full, Alertmanager notifications failing, target down |
| `cert_alerts.yml` | 3 | TLS certificate expiring warning (<30d), expiring soon (<7d), expired |
| `proxy_alerts.yml` | 4 | External proxy WAN down, WAN degraded (NPM up/:443 fail), Tailscale path down, slow HTTPS response |
| `app_health_alerts.yml` | 4 | Application down, slow response (>10s), content mismatch, HTTP 5xx error |
| `docker_health_alerts.yml` | 6 | Container unhealthy (healthcheck), OOM killed, restart loop (>5), no memory limit, log bloat (>100MB), stale image (>180d) |

## Grafana Dashboards (6, auto-provisioned)

All dashboards use a consistent visual design system: category-coloured section dividers (HTML text panels), sparkline-enabled stat panels, gradient-filled timeseries, and shared crosshair tooltips (`graphTooltip: 2`).

| Dashboard | UID | Focus |
|---|---|---|
| Home Infrastructure Overview | `home-overview` | Top-level summary landing page (27 panels, 6 sections) |
| PVE Cluster | `cluster-overview` | Per-node resources, guests, HA, replication, storage |
| Storage & Backups | `storage-health` | NAS ZFS pools/disks/SMART, PBS datastores/backup freshness |
| Docker Hosts | `docker-hosts` | Container status/resources across all 4 hosts |
| Network & Firewall | `network-firewall` | OPNsense interfaces, gateways, PF stats, Tailscale |
| Application Health | `application-health` | App HTTP health probes, response times, availability history, external proxy status |

**Category colour scheme:** Proxmox=#3274D9 (blue), NAS=#1F9E89 (teal), PBS=#8F3BB8 (purple), Docker=#00B2E2 (cyan), Network=#FF9830 (orange), Alerts=#F2495C (red)

**URL:** `https://monitor.###_DOMAIN` (proxied) or `http://###_IP:3000` (direct)

## Home Assistant Integration

Files in `ha-integration/` are manually copied to the HA configuration directory. Not deployed via Docker.

### REST Sensors (28)

| Category | Sensor | Entity ID |
|---|---|---|
| PVE | Nodes Online | `sensor.infra_pve_nodes_online` |
| PVE | Cluster Quorum | `sensor.infra_pve_cluster_quorum` |
| PVE | Guests Running | `sensor.infra_pve_guests_running` |
| PVE | HA Resources Started | `sensor.infra_pve_ha_resources_started` |
| PVE | HA Resources Total | `sensor.infra_pve_ha_resources_total` |
| PVE | Guests Total | `sensor.infra_pve_guests_total` |
| PVE | Nodes Total | `sensor.infra_pve_nodes_total` |
| NAS | Pool Health | `sensor.infra_nas_pool_health` |
| NAS | Pool Usage % | `sensor.infra_nas_pool_usage_percent` |
| NAS | Pool Free Bytes | `sensor.infra_nas_pool_free_bytes` |
| NAS | Disk Max Temp | `sensor.infra_nas_disk_max_temp` |
| PBS | Backup Max Age | `sensor.infra_pbs_backup_max_age_hours` |
| PBS | Datastore Usage | `sensor.infra_pbs_datastore_usage` |
| PBS | Verify Errors | `sensor.infra_pbs_verify_errors` |
| PBS | Datastore Free Bytes | `sensor.infra_pbs_datastore_free_bytes` |
| Docker | Containers Running | `sensor.infra_docker_containers_running` |
| Docker | Containers Total | `sensor.infra_docker_containers_total` |
| Network | Tailscale Peers Total | `sensor.infra_tailscale_peers` |
| Network | Tailscale Peers Online | `sensor.infra_tailscale_peers_online` |
| Network | Gateway RTT | `sensor.infra_gateway_rtt` |
| Network | Gateway Loss | `sensor.infra_gateway_loss` |
| Network | Gateway Status | `sensor.infra_gateway_status` |
| Replication | Max Lag | `sensor.infra_replication_max_lag_minutes` |
| Alerts | Critical | `sensor.infra_alerts_critical` |
| Alerts | High | `sensor.infra_alerts_high` |
| Alerts | Medium | `sensor.infra_alerts_medium` |
| Alerts | Total | `sensor.infra_alerts_total` |

### Automations (5)

| Automation | Trigger | Action |
|---|---|---|
| Critical Alert | `sensor.infra_alerts_critical > 0` for 1 min | Push + persistent notification |
| High Alert | `sensor.infra_alerts_high > 0` for 2 min | Push notification |
| Backup Overdue | `sensor.infra_pbs_backup_max_age_hours > 26` for 5 min | Push notification |
| Pool Health | `sensor.infra_nas_pool_health < 1` for 1 min | Push + persistent notification |
| Daily Summary | 08:00 daily | Push with all key metrics |

### Lovelace Dashboard

Single-view dashboard with 7 sections using markdown headers: Infrastructure Health (8 KPIs), Proxmox Cluster, NAS Storage, PBS Backups, Docker Containers, Network & VPN, Alerts (conditional) + Quick Links. Uses only native HA cards (no HACS dependencies).

## Key Design Decisions

- **Agentless architecture:** No agents installed on monitored hosts. All collection uses native REST APIs or SSH with existing credentials. Reduces maintenance burden and avoids modifying production systems.
- **PVE uses urllib (not requests):** The PVE API token format `root@pam!claude-code` contains a `!` character that causes shell escaping issues with some HTTP libraries. `urllib.request` from stdlib handles it correctly.
- **OPNsense SSH uses sh -c wrapper:** OPNsense runs FreeBSD with csh as the default shell. All SSH commands are automatically wrapped in `sh -c '...'` by the `SSHClient` when `opnsense_mode=True`, ensuring POSIX shell compatibility.
- **PBS uses REST API with `PBSClient` (urllib):** PBS supports API token authentication (`PBSAPIToken` header format, mirroring PVE's `PVEAPIToken`). `PBSClient` in `common.py` follows the same pattern as `PVEClient` — urllib, no SSL verify, token in `Authorization` header. API endpoints replace the previous SSH+CLI approach entirely.
- **OPNsense API secret uses manual Base64:** The OPNsense API secret contains `/` and `+` characters. Standard HTTP basic auth handling in some libraries misparses these, so the `key:secret` pair is manually base64-encoded.
- **TAILSCALE_SOCKET must be set:** On OPNsense, the Tailscale CLI requires `TAILSCALE_SOCKET=/var/run/tailscale/tailscaled.sock` to be set before any `tailscale` command.
- **collector_error_handler for graceful degradation:** A decorator on each collector's `collect()` method catches exceptions and yields `exporter_target_up=0` and `exporter_scrape_errors_total` instead of crashing the entire metrics endpoint. One failing target does not prevent other targets from being scraped.
- **Single exporter process:** All collectors run in a single process and HTTP server (port 9100). Prometheus scrapes trigger all collectors on each request. This keeps the container count low and simplifies configuration.
- **Alertmanager for AI investigation:** Alertmanager routes firing alerts to the AI investigator webhook. Webhook authentication via shared secret (`HM_AI_WEBHOOK_SECRET`) in the `X-Webhook-Secret` header.
- **Self-signed cert handling:** All monitored systems use self-signed certificates. SSL verification is disabled in all API clients (`verify=False`, custom `ssl.SSLContext`).
- **NAS uses SSH+CLI (not REST API):** After migration from TrueNAS SCALE to bare-metal Debian 13, ZFS pool, disk, and service metrics are collected via `zpool`, `smartctl` (JSON mode), and `systemctl` CLIs over SSH. The `smartctl -j` flag provides structured JSON output, avoiding fragile text parsing across disk types.
- **Named volumes for persistence:** Prometheus data (14d), Grafana state, and Uptime Kuma data use Docker named volumes rather than bind mounts.
- **Cooldown persistence:** The CooldownTracker optionally persists cooldown state to the SQLite database (`db_path`). On process restart, non-expired cooldowns are restored from the `cooldowns` table, preventing duplicate investigations after service restarts.
- **Docker-to-host connectivity via host.docker.internal:** The AI investigator runs as a systemd service on the host (port 8099), not in Docker. Docker bridge containers (alertmanager, prometheus, dashboard) reach it via `host.docker.internal`, which is mapped to the Docker bridge gateway IP using `extra_hosts: ["host.docker.internal:host-gateway"]` in `docker-compose.yml`. A hardcoded LAN IP (e.g. `###_IP`) fails because the host nftables firewall drops ingress from Docker bridge interfaces. The required nftables rule on claude-internal-tools is: `nft add rule inet host_filter input iifname "br-*" tcp dport 8099 accept`.
- **Topology config file:** Docker hosts and NAS services are externalised to `config/topology.yml`. The `DockerCollector` and `NASCollector` load topology at startup, falling back to hardcoded defaults if the file is absent. Adding a new Docker host requires only a YAML edit + redeploy.
- **Health endpoint:** The exporter HTTP server serves `/health` returning JSON with overall status (`healthy`/`degraded`/`starting`), uptime, and per-collector last-scrape results. The Docker healthcheck uses `/health` rather than `/metrics` to avoid triggering a full scrape.
- **SSH self-test on startup:** On exporter startup, each SSH client is tested with `echo ok`. Results are logged at INFO/WARNING level. Startup is never blocked on failure -- this surfaces connectivity issues in the first log lines.

## Uptime Kuma Monitors (30)

Provisioned via `scripts/setup-uptime-kuma.py`. Grouped by category:

| Category | Count | Monitor Type | Targets |
|---|---|---|---|
| PVE Nodes | 5 | HTTP (8006) | All 5 cluster nodes |
| VMs | 5 | Ping | All VMs (network-vm, HA, ancillary, claude-internal-tools, claudes-palace) |
| LXC | 1 | Ping | Technitium primary (CT 111) |
| Infrastructure | 4 | HTTP/Ping | NAS (ping), PBS Web UI, OPNsense, ZenWiFi |
| Services | 8 | DNS/HTTP/Port | DNS primary+secondary, NPM, HA API, Postfix, AgentDVR, iSCSI, Plex |
| Web Apps | 7 | HTTP | MTG Helper (dev+3 prod), Email Bot, Design Docs, Email Responder |
| Self-monitoring | 3 | HTTP | Prometheus, Grafana, Custom Exporters (localhost) |

**Notification:** All monitors send email via Postfix relay (###_IP:25) to stuart@###_DOMAIN.

## Prometheus Scrape Configuration

```yaml
scrape_configs:
  - job_name: "custom-exporters"     # All custom collectors
    targets: ["exporters:9100"]
    scrape_interval: 30s
    scrape_timeout: 25s

  - job_name: "prometheus"           # Self-monitoring
    targets: ["localhost:9090"]

  - job_name: "grafana"              # Grafana metrics
    targets: ["grafana:3000"]

  - job_name: "investigator"         # AI investigator (systemd, host via host.docker.internal)
    targets: ["host.docker.internal:8099"]
```

The `investigator` job uses `host.docker.internal` (resolved to the Docker bridge gateway via `extra_hosts: ["host.docker.internal:host-gateway"]` in `docker-compose.yml`) rather than a hardcoded LAN IP. The same hostname is used in the Alertmanager webhook URL and the dashboard `/api/investigations` proxy. See **Key Design Decisions** for the nftables requirement.

## Security Notes

- **Credentials in .env file:** API tokens, passwords, and SSH key paths are stored in `.env` (not committed to git via `.gitignore`)
- **Self-signed certs:** All SSL verification disabled for internal services
- **SSH key-based auth:** All SSH connections use key-based authentication (no passwords)
- **No external exposure:** All monitoring services are LAN-only except Grafana (proxied via monitor.###_DOMAIN)
- **Grafana:** Anonymous access disabled, sign-up disabled, admin password required