# Home Monitoring Suite — High-Level Design (current state)

**Status as of 2026-04-26:** Stack version 0.1.0. 8 Docker services + 1
systemd service. Tests post-Wave-5.1 fixes: **2098 collected, 974
passed, 7 pre-existing failures, 0 errors** (10 prior errors closed by
adding `pytest-aiohttp` to dev deps in this wave). 1 slow test suite
(`tests/test_dashboard_gaps.py`) excluded from the default run — it
needs network access. Project lives at
`/projects/apps/internal/home-monitoring/` (memory and several inline
references say `/projects/internal-apps/home-monitoring/` — that path
does not exist; see "Open contradictions").

Source footprint:

| Tree | Files | LOC |
|---|---|---|
| `exporters/` | 17 .py modules | ~4 700 |
| `dashboard/` | 15 .py modules | ~1 500 |
| `investigator/` | 14 .py modules | ~3 400 |
| `scripts/` | 4 .py | ~600 |
| `tests/` | 40 .py | ~33 000 |
| Total .py | 90 modules | ~10 000 src + ~33 000 tests |
| Grafana dashboards | 7 JSON | ~1 000 |
| Alert rule files | 10 YAML | ~600 |
| Runbooks | 16 YAML + index | ~700 |
| HA integration | 5 YAML | ~500 |

---

## 1. Architecture overview

### 1.1 Three-tier collection / storage / surface model

```
                 +----------------------------------+
                 |  Targets (agentless)             |
                 |  PVE x5, NAS, OPNsense, PBS,     |
                 |  Docker x5, Powerwall, Plex,     |
                 |  ZenWiFi, Postfix, NPM, certs    |
                 +----------------------------------+
                              ^
                              | REST / SSH / urllib / paramiko
                              |
                 +----------------------------------+
                 |  exporters/ (port 9100)          |
                 |  CompositeCollector +            |
                 |  17 per-system collectors        |
                 +----------------------------------+
                              ^
                              | HTTP scrape /metrics every 30s
                              |
       +--------------------+ | +---------------------+
       | Prometheus :9090   |<+-+ Uptime Kuma :3001   |
       | 14d retention      |   | 37 monitors         |
       | 10 alert rule files|   | (independent path)  |
       +--------------------+   +---------------------+
                |        |
        rules / |        | scrape
                v        v
       +---------------------------+      +-------------------+
       | Alertmanager :9093        |----->| investigator      |
       | route + group + inhibit   |      | :8099 (systemd)   |
       +---------------------------+      | claude -p AI loop |
                              |           +-------------------+
                              v
                +---------------------------+
                | Grafana :3000  (7 dash)   |
                | + dashboard/ (port 8080)  |
                | bespoke aiohttp + Alpine  |
                +---------------------------+
                              ^
                              | OTLP traces (HTTP 6006 / gRPC 4317)
                              |
       +--------------------+ |
       | Phoenix :6006      |-+
       | OTEL backend       |
       | (auth on, sqlite)  |
       +--------------------+
                              ^
                              | (claudes-palace claude_core.telemetry)
```

### 1.2 Operator surface

| Service | Container | Port | Purpose |
|---|---|---|---|
| `exporters` | home-monitoring-exporters | 9100 | `/metrics` (Prometheus text), `/health` (JSON) |
| `prometheus` | home-monitoring-prometheus | 9090 | Time-series DB, alert evaluator |
| `grafana` | home-monitoring-grafana | 3000 | 7 provisioned dashboards |
| `uptime-kuma` | home-monitoring-uptime-kuma | 3001 | Independent reachability monitoring (37 monitors) |
| `alertmanager` | home-monitoring-alertmanager | 9093 | Alert routing → investigator webhook |
| `dashboard` | home-monitoring-dashboard | 8080 | Bespoke aiohttp + Alpine.js HUD (`monitor.###_DOMAIN`) |
| `phoenix` | home-monitoring-phoenix | 6006 / 4317 | OTEL trace backend (auth + sqlite, used by claudes-palace) |
| `investigator` | (systemd on host) | 8099 | Webhook receiver + AI investigation loop via `claude -p` |

Docker services managed by `docker-compose.yml`. The investigator runs
on the host (systemd unit at `investigator/systemd/investigator.service`)
because it shells `claude -p` and needs filesystem access the container
can't have.

Deploy command: `/projects/scripts/deploy.sh home-monitoring internal`
(LXC migration pending — see Open Contradictions §11.1).

---

## 2. Exporters (port 9100)

### 2.1 CompositeCollector (`exporters/server.py`, ~250 ln)

Single HTTP server serves `/metrics` and `/health`. On startup it runs
SSH self-tests (key load, peer connectivity), then registers each
collector with `prometheus_client.REGISTRY`. The composite collector
tracks per-collector scrape state and emits a `hm_collector_up{name}`
gauge so failed collectors are visible in Grafana.

### 2.2 Per-system collectors (17 modules)

| File | Class | Target | Transport |
|---|---|---|---|
| `pve_exporter.py` | `PVECollector` | 5 PVE nodes | REST API (`PVEClient` urllib + token) |
| `nas_exporter.py` | `NASCollector` | NAS (Debian 13) | SSH (`SSHClient` paramiko) |
| `pbs_exporter.py` | `PBSCollector` | Proxmox Backup Server | REST API (`PBSClient`) + lazy `PVEClient` for stale-guest filter |
| `opnsense_exporter.py` | `OPNsenseCollector` | OPNsense firewall | REST + SSH (Tailscale peers) |
| `docker_exporter.py` | `DockerCollector` | 5 Docker hosts | SSH `docker ps`, `docker stats` |
| `proxy_exporter.py` | `ProxyCollector` | vanlint-com-nginx (LAN + WAN) | dual-path HTTPS probes + NPM API |
| `powerwall_exporter.py` | `PowerwallCollector` | Tesla Powerwall | local API |
| `media_exporter.py` | `MediaCollector` | Plex / Sonarr / Radarr / SABnzbd / Tdarr | REST APIs |
| `cert_exporter.py` | `CertCollector` | TLS endpoints | TCP + ssl.SSLSocket |
| `dns_exporter.py` | `DNSCollector` | Technitium DNS | API |
| `wifi_exporter.py` | `WiFiCollector` | ZenWiFi mesh | SSH + HTTPS |
| `postfix_exporter.py` | `PostfixCollector` | ancillary-vm | SSH + `mailq` |
| `app_health_exporter.py` | `AppHealthCollector` | apps via `config/topology.yml` | HTTP probes |
| `cron_exporter.py` | `CronCollector` | claudes-palace cron jobs | log scrape |

Shared utilities in `exporters/common.py`:
- `PVEClient`, `PBSClient`, `OPNsenseClient`, `SSHClient` —
  authenticated transport wrappers
- `load_topology(...)` — reads `config/topology.yml`, with hardcoded
  fallback if the file is absent
- `collector_error_handler("name")` — decorator turning any collector
  exception into a single emitted `hm_collector_error{name}` counter
  and a structured log line

### 2.3 Topology config (`config/topology.yml`)

Externalised inventory:
```
docker_hosts: {claude-internal-tools, nas, media-vm, ...}
nas_services: [smbd, nfs-server, docker, target, sshd]
critical_guests: {<vmid>: <slug>, ...}
```
Loaders fall back to hardcoded defaults if the YAML is missing or fails
to parse, so a malformed config never takes the exporter offline.

---

## 3. AI investigator (port 8099)

### 3.1 Pipeline

`investigator/__main__.py` runs `asyncio.run(main())` which spins up
`webhook.py`'s aiohttp server. Alertmanager POSTs to
`/webhook/alertmanager`; payload validated by `schemas.AlertmanagerPayload`;
flow:

```
Alertmanager → /webhook/alertmanager
    ↓ schemas.AlertmanagerPayload
investigator.InvestigationOrchestrator
    ↓ runbook_loader (match alert → runbook YAML)
    ↓ cooldown.CooldownTracker  (per-fingerprint; SQLite)
    ↓ safety.is_blocked / always_escalate
    ↓ tools.build_prompt
    ↓ cli_client (claude -p subprocess)
    ↓ history.HistoryStore (SQLite append)
    ↓ notifier.Notifier (email + HA push)
    ↓ metrics.* (Prometheus counters/histograms)
```

### 3.2 Module map

| Module | Purpose |
|---|---|
| `webhook.py` | aiohttp server: `/webhook/alertmanager`, `/metrics`, `/health`, `/api/*` |
| `config.py` | `InvestigatorConfig.from_env()` — loads `HM_AI_*` env vars |
| `schemas.py` | Pydantic `AlertmanagerPayload`, `TriageResult`, `InvestigationResult` |
| `runbook_loader.py` | YAML runbook matcher; falls back to `runbooks/generic.yml` |
| `cooldown.py` | per-fingerprint cooldown tracker, persists to SQLite |
| `safety.py` | always-escalate alert names, blocked target hosts |
| `tools.py` | prompt template (single, parameterised) |
| `cli_client.py` | `claude -p` subprocess wrapper, structured output |
| `investigator.py` | `InvestigationOrchestrator` — full pipeline |
| `notifier.py` | aiosmtplib email + HA REST push |
| `history.py` | SQLite history with stats (`/api/investigations`) |
| `metrics.py` | Prometheus counters / histograms / gauges for the AI pipeline |
| `fact_checker.py` | post-investigation sanity check (regex over claims) |

### 3.3 Webhook auth

`HM_AI_WEBHOOK_SECRET` (HMAC-style shared secret in header).
Empty/unset → auth disabled (dev mode). Tests in
`tests/test_webhook.py::TestWebhookAuthentication` cover all four
cases.

### 3.4 Runbooks (16 YAML files in `runbooks/`)

| File | Triggers |
|---|---|
| `app_health.yml` | App HTTP probe failures |
| `cert_expiry.yml` | TLS expiry warnings |
| `cron_integrity.yml` | Cron heartbeat misses |
| `docker_container.yml` | Container down/restart loop |
| `docker_health.yml` | Healthcheck failure / OOM |
| `generic.yml` | Fallback for unmapped alerts |
| `laptop_alerts.yml` | Family laptop alerts |
| `media_vm_failover.yml` | Media VM failover |
| `meta_monitoring.yml` | Self-health (scrape, AM, storage) |
| `nas_storage.yml` | Pool / disk / SMART |
| `network_service.yml` | OPNsense / Tailscale / WAN |
| `pbs_backup.yml` | Backup stale / verify error / sync |
| `proxy_health.yml` | External proxy WAN / Tailscale path |
| `pve_guest_down.yml` | Guest unreachable |
| `pve_node_down.yml` | Node down |
| `pve_node_resource.yml` | Node CPU / mem / disk pressure |

`_index.yml` maps alert name → runbook file.
`scripts/validate_runbooks.py` checks index ↔ files consistency
(run nightly via timer).

---

## 4. Bespoke dashboard (port 8080)

### 4.1 Server (`dashboard/server.py`)

aiohttp app with 11 GET routes registered by
`api.register_routes(app)`. Cache layer via
`dashboard/cache.py::TTLCache` (300 s default; per-route override).
Prometheus client (`dashboard/prometheus_client.py`) wraps `query` /
`query_range`.

| Route | File | Purpose |
|---|---|---|
| `/api/overview` | `api_overview.py` | Global health KPIs |
| `/api/pve` | `api_pve.py` | Nodes, guests, HA, replication |
| `/api/storage` | `api_storage.py` | NAS pools/disks + PBS datastores |
| `/api/docker` | `api_docker.py` | Per-host containers + resources |
| `/api/network` | `api_network.py` | Gateways, interfaces, Tailscale, PF, WiFi, DNS, Postfix |
| `/api/energy` | `api_energy.py` | Powerwall + solar + grid |
| `/api/media` | `api_media.py` | Plex / Sonarr / Radarr / SABnzbd / Tdarr |
| `/api/certs` | `api_certs.py` | TLS cert expiry (300 s TTL) |
| `/api/alerts` + `/api/investigations` | `api_alerts.py` | Alertmanager + investigator proxy |
| `/api/timeseries` | `api_timeseries.py` | Validated range query passthrough |

### 4.2 Frontend (`dashboard/static/`)

Single `index.html` driven by Alpine.js. Charts via Apache ECharts
(vendored). Polls each `/api/*` endpoint on a per-section interval.
LAN-only — no external CDN.

```
static/
├── index.html
├── css/   (variables, layout, components, animations)
├── js/    (dashboard.js — Alpine component, charts.js, utils.js)
├── img/   (favicon)
├── vendor/ alpine.min.js, echarts.min.js
└── fonts/ ibm-plex-sans-* + jetbrains-mono-*
```

### 4.3 Frontend migration (per 2026 roadmap)

Memory `project_2026_short_medium_roadmap.md` directs:
> Frontend migrates entirely into Palace Command Center — no separate
> Grafana dashboard surface long-term.

The bespoke dashboard at `:8080` is the FE that's planned to migrate.
Grafana itself stays for ad-hoc exploration. See `ROADMAP.md` H2-1.

---

## 5. Phoenix OTEL trace backend (port 6006 / 4317)

Added 2026-04-25 (settled in commit `7d339b1` of this wave).

| Aspect | Detail |
|---|---|
| Image | `arizephoenix/phoenix:14.14.0` |
| Auth | `PHOENIX_ENABLE_AUTH=true` (HM_PHOENIX_SECRET shared) |
| Storage | sqlite at `/mnt/data/phoenix.db` (volume `phoenix-data`) |
| Endpoints | UI + OTLP HTTP on 6006; OTLP gRPC on 4317 |
| mem_limit | 1 GB |
| Healthcheck | `python -c urllib.urlopen('http://localhost:6006/healthz')` |
| Uptime Kuma | HTTP monitor on `/healthz` |
| Producer | `claude_core.telemetry` from claudes-palace |

Per `~/.claude/CLAUDE.md`:
> **Observability:** Phoenix OTEL available — integrate per-project
> when using LangGraph/PydanticAI

So Phoenix is a *backend* for any project that opts in. Today only
claude_core's telemetry export points at it; future LangGraph/PydanticAI
work will send traces here too.

---

## 6. Prometheus + Grafana + Uptime Kuma

### 6.1 Prometheus

`config/prometheus/prometheus.yml`:
- `custom-exporters` job — scrape `:9100` every 30 s, 25 s timeout
- `sigil-grid` job — 4 sigil instances at `192.168.8.{236,243,244,245}:8080`
  tagged `sigil_variant={baseline,moderate,aggressive,unhinged}`. Per
  the 2026 roadmap, sigil's strategy streams will eventually merge
  into one instance with configurable postures; this scrape config
  simplifies then.
- self-monitoring: prometheus + alertmanager scrape themselves
- 14-day retention, `--web.enable-lifecycle` for hot reload

### 6.2 Alert rule files (10)

| File | Coverage |
|---|---|
| `node_alerts.yml` | PVE node / cluster / guest / HA / replication |
| `storage_alerts.yml` | NAS pool / disk + PBS datastore |
| `backup_alerts.yml` | PBS freshness, verify, sync (vector-match for sync error/ok ratio) + ZFS replication lag |
| `docker_alerts.yml` | Container status / restarts / resources + host disk/memory |
| `service_alerts.yml` | OPNsense WAN / gateway / interface + Tailscale peers |
| `meta_alerts.yml` | Monitoring stack self-health |
| `cert_alerts.yml` | TLS expiry (warning / soon / expired) |
| `proxy_alerts.yml` | External proxy WAN / Tailscale path down / degraded / slow |
| `app_health_alerts.yml` | Application HTTP health probes |
| `docker_health_alerts.yml` | Container healthchecks, OOM, restart loop, log bloat |

`recording_rules.yml` — pre-computes ratios + aggregations.
Validation: `scripts/validate_alerts.py`.

### 6.3 Alertmanager

`config/alertmanager/alertmanager.yml`:
- Webhook receiver → investigator at `:8099/webhook/alertmanager`
  (auth header `HM_AI_WEBHOOK_SECRET` if set)
- Grouping by alert + cluster + service
- Inhibition rules (critical suppresses warning on same target)

### 6.4 Grafana (7 dashboards)

`config/grafana/provisioning/`:
- Datasource: Prometheus auto-provisioned
- Dashboards file provider
- 7 JSON dashboards: home-overview, cluster-overview, storage-health,
  docker-hosts, network-firewall, ai-investigations, application-health
- (logs-overview was attempted but is currently misconfigured — uses
  loki uid where prometheus is expected; see Open Contradictions §11.4)

### 6.5 Uptime Kuma (37 monitors)

`scripts/setup-uptime-kuma.py` — declarative monitor list +
`--reconcile` mode (create missing, update changed, report orphans;
`--delete-orphans` to prune; `--dry-run` for preview). Phoenix `/healthz`
added in this wave's settle commit.

### 6.6 Loki

`docker-compose.loki.yml` — separate compose file for log aggregation.
`promtail/promtail-claudes-palace.yaml` — promtail config (mounted
into a side container on claudes-palace; comment shows the docker run
incantation). Loki + promtail are deployed but not fully wired into
Grafana (logs-overview dashboard is the in-progress surface).

---

## 7. Home Assistant integration

`ha-integration/` (5 YAML files, manually copied to HA config):

| File | Content |
|---|---|
| `sensors.yaml` | 30 REST sensors querying Prometheus HTTP API |
| `automations.yaml` | 5 automations: critical/high alerts, backup overdue, pool degraded, daily summary |
| `lovelace-dashboard.yaml` | Lovelace "Infrastructure Monitoring" view |
| `ai-investigation-sensors.yaml` | 3 REST sensors querying investigator API |
| `ai-investigation-automations.yaml` | 1 automation: investigation complete notification |

Push notifications go via HA's built-in mobile app integration. Auth
between HA and the monitoring stack uses `HM_HA_TOKEN` (long-lived
HA token).

---

## 8. Configuration surface

### 8.1 `HM_*` env vars (~30 keys)

Categorised:

| Category | Keys |
|---|---|
| Runtime | `HM_EXPORTER_PORT`, `HM_LOG_LEVEL` |
| Grafana | `HM_GRAFANA_ADMIN_USER` / `_PASSWORD` |
| Uptime Kuma | `HM_KUMA_ADMIN_USER` / `_PASSWORD` |
| PVE | `HM_PVE_API_BASE`, `HM_PVE_TOKEN_ID`, `HM_PVE_TOKEN_SECRET` (+ pre-assembled `HM_PVE_TOKEN`) |
| PBS | `HM_PBS_API_BASE`, `HM_PBS_TOKEN_ID` / `_SECRET` (+ pre-assembled `HM_PBS_TOKEN`); optional `HM_PBS_PVE_API_BASE` for the stale-guest filter |
| OPNsense | `HM_OPNSENSE_API_BASE` / `_KEY` / `_SECRET` (+ SSH host/user) |
| NAS | `HM_NAS_SSH_HOST` / `_USER` |
| Docker | `HM_DOCKER_*_HOST` (per host), `HM_DOCKER_USER`, `HM_SSH_KEY_PATH` |
| Email | `HM_SMTP_HOST` / `_PORT` / `_FROM`, `HM_ALERT_EMAIL` |
| Proxy | `HM_PROXY_WAN_IP` / `_TAILSCALE_IP` / `_CANARY_URL` (`https://ha.###_DOMAIN`) / `_PROBE_TIMEOUT` |
| NPM | `HM_NPM_EXTERNAL_EMAIL` / `_PASSWORD` |
| Phoenix | `HM_PHOENIX_SECRET` |
| Investigator | `HM_AI_WEBHOOK_SECRET`, `HM_AI_*` (model/timeouts in `investigator/config.py`) |
| HA | `HM_HA_TOKEN` (long-lived) |

Defaults / deprecated keys called out in PROJECT_REFERENCE.md §"Environment
Variables". claude-dev / claude-prod env defaults (###_IP / .200)
are annotated DECOMMISSIONED 2026-04-24 — kept so smoke-tests don't
fail with KeyError, but the values are dead.

### 8.2 `aie-metrics.yaml`

Added in this wave's settle commit. AIE config covers:
test_pass_rate, test_coverage, lint_violations, type_errors,
cyclomatic_complexity, security_issues, dead_code.

---

## 9. Test surface

40 test files / ~33 000 LOC. Pre-Wave-5.1 baseline: 974 passed, 7
pre-existing failures (PBS sync ledger ambiguity, runbook count
drift, dashboard datasource Loki/Prometheus mismatch — see §11.4),
10 errors (all `aiohttp_client` fixture missing — fixed this wave by
adding `pytest-aiohttp` to `requirements-dev.txt`).

`tests/test_dashboard_gaps.py` — the slow one — is excluded from the
default run. It hits real network endpoints. Should be marked
`@pytest.mark.integration` so it's opt-in. Carried as ROADMAP H1-3.

Layout:

```
tests/
├── conftest.py          — shared mocks (mock_pve_client, mock_ssh_client,
│                          mock_orchestrator, mock_paramiko, etc.)
├── test_<module>.py     — one per source module
└── test_e2e_*           — none yet — see ROADMAP H2-2
```

---

## 10. Deployment

### 10.1 Target

- **Host:** claude-internal-tools (VM 209, ###_IP)
- **Migration:** pending — docker compose stack to LXC native (per
  `~/.claude/CLAUDE.md` "LXC Native Migration", PROJECT_REFERENCE.md §"Deployment Target").
- **Deploy command:** `/projects/scripts/deploy.sh home-monitoring internal`
- **URL:** `monitor.###_DOMAIN` (LAN-only via npm-lan)

### 10.2 Investigator on host (not Docker)

Investigator runs as systemd service because it shells `claude -p`
which needs filesystem access. Unit file at
`investigator/systemd/investigator.service`. `installed by deploy.sh`.

### 10.3 First-time setup

1. `.env.example` → `.env` with credentials
2. `/projects/scripts/deploy.sh home-monitoring internal`
3. Wait ~30 s for containers
4. Create Uptime Kuma admin via web UI at `:3001`
5. `python scripts/setup-uptime-kuma.py --reconcile`
6. Copy `ha-integration/*.yaml` to HA config dir (manual)

---

## 11. Open contradictions / S-tier findings

### 11.1 LXC migration pending

PROJECT_REFERENCE.md says "claude-internal-tools (VM 209) — pending
migration to LXC". Memory `project_2026_short_medium_roadmap.md` archives
the LXC Native Migration project ("once-off complete; outstanding 3
apps not yet migrated handled ad-hoc"). Home-monitoring is one of the
"outstanding 3". ROADMAP H2-3.

### 11.2 Path-of-record drift

PROJECT_REFERENCE.md, CLAUDE.md, and promtail/promtail-claudes-palace.yaml
reference `/projects/internal-apps/home-monitoring/` and
`/projects/home-monitoring/`. Actual path is
`/projects/apps/internal/home-monitoring/`. CLAUDE.md still has
`/projects/home-monitoring`; PROJECT_REFERENCE.md has
`/projects/internal-apps/home-monitoring/` (typo / older naming).
Cleanup: ROADMAP H1-1.

### 11.3 Frontend migration to Palace Command Center

Per memory roadmap: dashboard moves to Palace Command Center; backend
stack stays. ROADMAP H2-1.

### 11.4 Loki integration is half-wired

`docker-compose.loki.yml` exists. `promtail-claudes-palace.yaml` is
authored. But the `logs-overview.json` Grafana dashboard is configured
with Loki datasource UIDs while the validator
(`tests/test_dashboard_validation.py::test_datasource_references[logs-overview.json]`)
expects all dashboards to reference `prometheus`. One of two:
- the dashboard is correct (Loki is real) and the validator needs an
  exception for this dashboard; or
- the dashboard was authored speculatively before Loki landed and
  should be parked or removed
The current state is "test fails on master, has been failing long
enough that it counts as accepted". ROADMAP H1-2.

### 11.5 Test count drift / 7 pre-existing failures

Pre-existing master failures, snapshot 2026-04-26:

| Test | Suspected cause |
|---|---|
| `test_dashboard_validation.py::test_datasource_references[logs-overview.json]` | Loki/Prometheus mismatch — see §11.4 |
| `test_investigator_orchestrator.py::TestNotificationOnly::test_notification_only_returns_skipped` | Notification-only set / fixture drift |
| `test_investigator_orchestrator.py::TestNotificationOnly::test_notification_only_does_not_invoke_cli` | Same |
| `test_pbs_exporter.py::TestPBSTasks::test_sync_last_success` | Sync task fixture has multiple datastores; assertion expects single |
| `test_safety.py::TestNotificationOnlyAlerts::test_notification_only_set_exact` | safety.py constants diverged from test |
| `test_safety.py::TestNotificationOnlyAlerts::test_docker_no_memory_limit_in_set` | Same |
| `test_validate_runbooks.py::TestIndexFile::test_total_mapping_count` | Runbook index drift (16 files vs N expected) |

ROADMAP H1-4.

### 11.6 `aiohttp_client` fixture errors — FIXED this wave

10 errors in `test_webhook.py` were `fixture 'aiohttp_client' not
found`. Plugin `pytest-aiohttp` was never added as a dev dep.
`requirements-dev.txt` introduced this wave with `pytest-aiohttp>=1.1.0`
plus the rest of the test stack. After install: all 10 errors resolve.

### 11.7 Stray hive-mind docs (cleaned)

`docs/architect-design.md`, `architect-draft.md`,
`challenge-{ops,robustness,simplicity}.md` were untracked email-bot
hive-mind outputs that landed in this project's `docs/` by mistake.
Moved to `/tmp/strays/email-bot-hive-mind-from-home-monitoring/` in
the settle commit; will be triaged when email-bot work resumes (HOLD
per Stuart's instruction).

### 11.8 BACKLOG drift

PROJECT_REFERENCE.md "Codebase Stats" claims 31 test files; actual is
40. Uptime Kuma monitors: 30 stated, 37 actual, 38 declared in CLAUDE.md.
Cleanup: ROADMAP H1-5.

### 11.9 Empty `runbooks/_index.yml` mappings

Some alert names referenced in `*_alerts.yml` may not have a matching
`_index.yml` entry. `scripts/validate_runbooks.py` is the validator;
if it's been failing, it's only logging, not blocking deploys.
ROADMAP H1-6.

### 11.10 No `.pre-commit-config.yaml`

Wave 4.1 (claude-pipeline) and Wave 4.2 (AIE) added pre-commit. AIE's
shape is a good fit. ROADMAP H1-7.

---

## 12. Documentation index

* `README.md` — short landing
* `PROJECT_REFERENCE.md` — operator reference (path drift; refresh
  pending H1-1)
* `CLAUDE.md` — Watson routing notes (path drift; refresh pending H1-1)
* `IMPROVEMENT_PLAN.md` — historical wave plan (treat as archive)
* `PROJECT_BACKLOG.md` — backlog (refresh pending H1-5)
* `docs/HLD.md` — this document
* `docs/ROADMAP.md` — forward roadmap
* `docs/sweep-2026-04.md` — companion sweep findings
* `docs/FRONTEND_DESIGN.md` — original FE design (should migrate to
  Palace Command Center per H2-1)
* `docs/claude-code-best-practices-summary.md` — reference doc
* `docs/email-bot-product-overview.pdf` — out of scope (orphan)
* `docs/infrastructure-diagram*.html` — visualisations

---

End of HLD.