# Home Monitoring Suite — Roadmap (forward only)

**Scope:** narrow forward roadmap for the home-monitoring project itself.
Cross-bundle moves (FE → Palace Command Center) are tracked here as the
canonical home-monitoring perspective; the receiving project owns its
own perspective.

**Default stance:** Opus + K2.6 implementer harness for non-trivial
work, hive-mind for any architectural redesign.

**Inputs feeding this roadmap:**

* Memory `project_2026_short_medium_roadmap.md` — backend stays here,
  frontend migrates to Palace Command Center, drop remote claude.
* Memory `project_home_monitoring.md` — project tracker.
* `sweep-2026-04.md` — current sweep findings.

---

## Horizon 0 — Wave 5.1 deltas (this commit)

| Item | Change | Status |
|---|---|---|
| H0-1 | Resolve 10 `aiohttp_client` test errors (add `pytest-aiohttp`) | ✅ this commit |
| H0-2 | `requirements-dev.txt` introduced | ✅ this commit |
| H0-3 | `ruff format` 3 files (cron_exporter, cli_client, test_cli_client) | ✅ this commit |
| H0-4 | `ruff check --fix` server.py I001 | ✅ this commit |
| H0-5 | `docs/HLD.md`, `docs/ROADMAP.md`, `docs/sweep-2026-04.md` | ✅ this commit |
| Settle | Phoenix OTEL + sigil-grid + PBS guest filter (commit `7d339b1`) | ✅ already landed |

---

## Horizon 1 — Doc / hygiene reconciliation (next 2-4 weeks)

### H1-1: Path-of-record reconciliation

`PROJECT_REFERENCE.md` references `/projects/internal-apps/...`,
`CLAUDE.md` references `/projects/home-monitoring/`,
`promtail-claudes-palace.yaml` has the latter as a comment. Real path
is `/projects/apps/internal/home-monitoring/`. Update all to one
canonical path.

* **Effort:** trivial.

### H1-2: Resolve Loki dashboard validator failure

`logs-overview.json` references `loki` datasource UID. Validator
expects `prometheus`. One of:
- Loki is real; teach `tests/test_dashboard_validation.py` an exception
  for `loki` datasource UIDs on `logs-overview.json` only; or
- Loki is speculative; remove the dashboard until promtail+loki are
  fully wired and the FE can render it.

The promtail + loki side is partially deployed. Pick path 1 (extend
validator) — it preserves the half-built work and unblocks the test.

* **Effort:** small. planner+1.

### H1-3: Mark `tests/test_dashboard_gaps.py` as integration

Currently in the default test set; runs slow because it touches real
network endpoints. Add `@pytest.mark.integration` so it's opt-in via
`pytest -m integration`. Update CI to opt in only on the deploy path.

* **Effort:** trivial.

### H1-4: Fix or document the 7 pre-existing test failures

| Test | Disposition |
|---|---|
| `test_dashboard_validation.py::test_datasource_references[logs-overview.json]` | H1-2 above |
| `test_investigator_orchestrator.py::TestNotificationOnly::test_*` (×2) | Audit `safety.NOTIFICATION_ONLY_ALERTS` set vs the test fixture; reconcile |
| `test_pbs_exporter.py::TestPBSTasks::test_sync_last_success` | Test expects single datastore, fixture has two; tighten the assertion |
| `test_safety.py::TestNotificationOnlyAlerts::test_*` (×2) | Same set drift as orchestrator tests |
| `test_validate_runbooks.py::TestIndexFile::test_total_mapping_count` | Runbook count drift (16 files vs N expected) |

* **Effort:** small per test. Solo or planner+1 in a single sweep.

### H1-5: BACKLOG / count refresh

PROJECT_REFERENCE.md "Codebase Stats" is stale (31 → 40 test files;
30 → 37 Uptime Kuma monitors). CLAUDE.md says 38 monitors; actual 37.
Truth-up.

* **Effort:** trivial.

### H1-6: Runbook coverage audit

Run `scripts/validate_runbooks.py` and report any alert names
referenced in `*_alerts.yml` that don't have an `_index.yml` entry.
If any, either add the runbook or relax the validator.

* **Effort:** small.

### H1-7: Pre-commit harness

Mirror Wave 4.2 (AIE) pattern:
`.pre-commit-config.yaml` (ruff + bandit + shellcheck + yamllint +
detect-secrets + gitleaks + std hooks), `.secrets.baseline`,
`.gitleaksignore` if needed.

* **Effort:** small.

### H1-8: Path-comment cleanup in promtail-claudes-palace.yaml

Header comment shows `docker run` example with `/projects/internal-apps/...`
— update to canonical path once H1-1 lands. Also confirm the actual
mount source on claudes-palace matches.

* **Effort:** trivial — folds into H1-1.

---

## Horizon 2 — Architecture deltas (next 2-4 months)

### H2-1: Frontend migrates to Palace Command Center

Per memory `project_2026_short_medium_roadmap.md`:
> Frontend migrates entirely into Palace Command Center — no separate
> Grafana dashboard surface long-term.

Today's bespoke dashboard at `:8080` is the FE that's planned to
migrate. Shape:
- Palace Command Center absorbs the 11 `/api/*` route shapes (same
  contracts, same JSON).
- Bespoke dashboard `:8080` retired; redirect `monitor.###_DOMAIN`
  to Palace.
- Grafana stays for ad-hoc / per-metric exploration.
- Backend stack (exporters / Prometheus / Alertmanager / Uptime Kuma /
  Phoenix / investigator) keeps running on home-monitoring's host.

* **Approach:** hive-mind design run on the contract surface +
  per-route migration; dev-team implementation per route.
* **Effort:** medium-large. Plan with Stuart before kickoff.
* **Blocks:** none, but H1-2 + H1-4 should land first so tests pass
  before the move.

### H2-2: Drop remote claude deployment for the investigator

Per memory roadmap:
> Remove the remote claude deployment used for investigator. Reconfigure
> so claude runs on claudes-palace and is invoked locally to investigate
> issues.

Today the investigator shells `claude -p` over SSH (or local subprocess
on the host). Move to a model where investigation is dispatched from
claudes-palace. Choices:
- (a) HTTP RPC: investigator stays here, dispatches via HTTP to a
  claudes-palace endpoint that runs `claude -p` locally and returns
  the result; or
- (b) move the whole investigator to claudes-palace and have it
  consume the alertmanager webhook over the network.

Recommended: (a) — keeps the alertmanager → investigator wire short
and on the same network as Prometheus. Adds an outbound HTTP call to
claudes-palace per investigation, which is fine.

* **Effort:** medium.

### H2-3: LXC migration from VM 209

Per CLAUDE.md "pending migration to LXC" and memory's "outstanding 3
apps not yet migrated" callout. Move the docker compose stack from
the claude-internal-tools VM to an LXC. Touches:
- VM 209 → LXC <new-id>
- DNS: monitor.###_DOMAIN points at the new container
- npm-lan proxy host updated
- Investigator systemd unit migrated to LXC
- Phoenix data volume preserved

* **Effort:** small (1-day operation; deploy.sh handles the cutover).
* **Sequence:** after H2-1 begins (so the FE move isn't entangled
  with the VM/LXC move).

---

## Horizon 3 — Operational maturity (4-6 months)

### H3-1: E2E tests

Currently no `test_e2e_*.py` files. Per CLAUDE.md global rule
("All app functions must be end-to-end testable"), add Playwright
or curl-based E2E harness covering the dashboard's 11 `/api/*`
routes. Or — if H2-1 lands first — write E2E in Palace Command Center
and retire any duplicate here.

### H3-2: Phoenix integration sweep

Phoenix is a backend now. Ensure:
- claude_core.telemetry from claudes-palace points at it (already done)
- LangGraph / PydanticAI integrations (per `~/.claude/CLAUDE.md`
  AI Orchestration Tools) emit OTLP traces here when they go live
- Auth + sqlite retention sane (`/mnt/data/phoenix.db` cap?)
- Add a Grafana dashboard sourcing Phoenix's OTLP backend

### H3-3: Loki + promtail finalisation

If H1-2 takes path 1 (Loki is real), finish the wiring: validate
Grafana datasource UID, build out `logs-overview.json`, ensure
promtail-claudes-palace is deployed and shipping, document the
log retention policy.

### H3-4: Investigator confidence calibration

`investigator/cli_client.py` returns structured output with confidence
scores. Today nothing reads them. Ratchet: if confidence < N, force
human-in-the-loop notification; if > M, suppress duplicate
notifications. Ties into AIE's calibration model (memory
`project_aie_ace_baseline.md`).

### H3-5: Drop the `200|207|208` static exclude

`backup_alerts.yml` still has the static exclude as a safety net for
the dynamic stale-guest filter. Once the dynamic filter has been
running for 30 days with no false positives, drop the static.

---

## Horizon 4 — Nice-to-haves / opportunistic

* **H4-1:** Move `IMPROVEMENT_PLAN.md` to `docs/history/` (it's the
  2026-03 wave plan, all done).
* **H4-2:** Consolidate `docs/infrastructure-diagram*.html` into a
  single source of truth.
* **H4-3:** Replace `docs/email-bot-product-overview.pdf` with a
  proper attribution (or move it out — it's not home-monitoring's).
* **H4-4:** Add `pyright` to pre-commit if H1-7 lands.

---

## Out of scope (explicitly)

* Email-bot or email-bot-analyst work — held per Stuart's instruction.
* General observability beyond claude_core.telemetry → Phoenix.
* Replacing Prometheus or Grafana.
* Replacing the bespoke dashboard with Grafana — the bespoke surface
  becomes Palace Command Center, not Grafana.

---

End of ROADMAP.
