From b6e4103357a518b2e6574bea31f55e489e673d4b Mon Sep 17 00:00:00 2001
From: Matthias Nott <mnott@mnsoft.org>
Date: Thu, 26 Feb 2026 11:36:09 +0100
Subject: [PATCH] docs: extract 14 additional session notes from MDF Webseiten (0013-0026)
---
Notes/2026/02/0018 - 2026-02-22 - CLI Contract Spec, Sync Compliance, Dashboard Bidirectional UI.md | 44 +++
Notes/2026/02/0020 - 2026-02-23 - Backup Coverage Audit, Registry Fixes, Container Resolution.md | 35 ++
Notes/2026/02/0026 - 2026-02-25 - Persistent Jobs and Container Terminal.md | 69 +++++
Notes/2026/02/0024 - 2026-02-23 - Toolkit and CLI Rewrite and Dashboard Migration.md | 65 +++++
Notes/2026/02/0015 - 2026-02-22 - Offsite Backup Dashboard Fix & Status Format.md | 39 +++
Notes/2026/02/0017 - 2026-02-22 - Modular Sync Promote Rebuild Architecture.md | 61 +++++
Notes/2026/02/0021 - 2026-02-23 - Rebuild.py Coolify-Only Lifecycle, SSE Keepalive, Traefik Flush.md | 35 ++
Notes/2026/02/0013 - 2026-02-20 - Infrastructure Repo & Ops CLI Bootstrap.md | 59 ++++
Notes/2026/02/0016 - 2026-02-22 - Backup Drill-Down Redesign & Restore Fix.md | 35 ++
Notes/2026/02/0014 - 2026-02-20 - Registry Naming & Backup System.md | 42 +++
Notes/2026/02/0019 - 2026-02-22 - Offsite Download Feature Added to Dashboard.md | 27 ++
Notes/2026/02/0025 - 2026-02-24 - Dashboard Bugs and SL Routing Fixes.md | 62 +++++
Notes/2026/02/0023 - 2026-02-23 - Toolkit Bootstrap Starting Point.md | 44 +++
Notes/2026/02/0022 - 2026-02-23 - Post-Coolify Architecture Context for Ops Toolkit.md | 52 ++++
14 files changed, 669 insertions(+), 0 deletions(-)
diff --git a/Notes/2026/02/0013 - 2026-02-20 - Infrastructure Repo & Ops CLI Bootstrap.md b/Notes/2026/02/0013 - 2026-02-20 - Infrastructure Repo & Ops CLI Bootstrap.md
new file mode 100644
index 0000000..1cf1735
--- /dev/null
+++ b/Notes/2026/02/0013 - 2026-02-20 - Infrastructure Repo & Ops CLI Bootstrap.md
@@ -0,0 +1,59 @@
+# Session 0013: Infrastructure Repo & Ops CLI Bootstrap
+
+**Date:** 2026-02-20
+**Status:** Completed
+**Origin:** MDF Webseiten session 0018
+
+---
+
+## Work Done
+
+- [x] Created infrastructure repo at `git.mnsoft.org/git/APPS/infrastructure.git`
+- [x] Local clone: `/Users/i052341/Daten/Cloud/08 - Others/MDF/Infrastruktur/Code/infrastructure/`
+- [x] Server clone: `/opt/infrastructure/`
+- [x] Wrote `ops` CLI (bash, ~250 lines) — symlinked to `/usr/local/bin/ops`
+- [x] Created `servers/hetzner-vps/registry.yaml` — single source of truth for 5 projects
+- [x] Captured 5 Traefik dynamic configs from server into git
+- [x] Wrote `monitoring/healthcheck.sh` — container health + disk checks → ntfy
+- [x] Installed `ops-healthcheck.timer` (every 5 min) on server
+- [x] Added Docker labels (`ops.project`, `ops.environment`, `ops.service`) to all MDF compose files
+- [x] Replaced hardcoded `container_name()` in `sync.py` with label-based discovery + UUID suffix fallback
+- [x] Verified: `ops status`, `ops health`, `ops disk`, `ops backup mdf prod` all working
+
+## Repo Structure Created
+
+```
+infrastructure/
+├── ops # The ops CLI (bash)
+├── servers/hetzner-vps/
+│ ├── registry.yaml # 5 projects defined
+│ ├── traefik/dynamic/ # Traefik configs captured
+│ ├── bootstrap/ # Coolify service payloads
+│ ├── scaffolding/ # Shell aliases, SSH hardening, venv setup
+│ ├── systemd/ # 6 timer/service units
+│ └── install.sh # Full fresh server setup script
+├── monitoring/
+│ ├── healthcheck.sh
+│ ├── ops-healthcheck.service
+│ └── ops-healthcheck.timer
+└── docs/architecture.md
+```
+
+## Key Decisions / Learnings
+
+- `ops` CLI uses `SCRIPT_DIR` with `readlink -f` for symlink-safe path resolution
+- `registry.yaml` uses a `name_prefix` field; container matching uses `grep` with word anchoring to prevent substring false matches
+- Label-based discovery is primary; Coolify UUID suffix prefix-search is the fallback
+- Docker labels added to compose files are not live on running containers until restart — noted as gap
+
+## Files Changed
+
+- `/opt/infrastructure/ops` — new ops CLI (bash)
+- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — new registry
+- `/opt/infrastructure/monitoring/healthcheck.sh` — new healthcheck script
+- `Code/mdf-system/docker-compose.yaml` — added ops.* Docker labels
+- `Code/mdf-system/scripts/sync/sync.py` — label-based container discovery, domain map fix
+
+---
+
+**Tags:** #Session #OpsCLI #Infrastructure
diff --git a/Notes/2026/02/0014 - 2026-02-20 - Registry Naming & Backup System.md b/Notes/2026/02/0014 - 2026-02-20 - Registry Naming & Backup System.md
new file mode 100644
index 0000000..af29e22
--- /dev/null
+++ b/Notes/2026/02/0014 - 2026-02-20 - Registry Naming & Backup System.md
@@ -0,0 +1,42 @@
+# Session 0014: Registry Naming & Backup System
+
+**Date:** 2026-02-20
+**Status:** Completed
+**Origin:** MDF Webseiten session 0019
+
+---
+
+## Work Done
+
+- [x] Fixed `sl-website` registry placement — moved under `seriousletter.services.website` to resolve prefix collision
+- [x] Renamed all 7 Coolify services to consistent `{project}-{env/purpose}` lowercase naming
+- [x] Deleted stale stopped MDF Dev duplicate from Coolify (UUID: qw8wso0ckskccoo0kcog84c0)
+- [x] Fixed `ops backup/restore/sync` argument validation (was crashing on unbound variable)
+- [x] Fixed SL CLI path in `registry.yaml` (pointed to wrong location)
+- [x] Added `container_name()` to SL `sync.py` with label + prefix fallback (mirrors MDF pattern)
+- [x] Made `ops backup <project>` work without env arg (passes `--all` to CLI)
+- [x] Added backup summary to `ops status` — latest backup per project/env, size, age with color coding
+- [x] Consolidated backup dirs to `/opt/data/backups/{project}/{env}/` across all projects
+- [x] Updated both MDF and SL CLIs for per-env backup subdirectory structure
+- [x] Volume consolidation: all data migrated from 10GB to 50GB volume at `/opt/data`
+- [x] Updated all path references across compose files, CLIs, systemd units, registry, ops CLI
+
+## Key Decisions / Learnings
+
+- Registry was initially ambiguous about where `sl-website` lived — prefix collision with other SL services caused matching bugs. Moving it under a `services.website` key made the prefix unique.
+- Per-env backup subdirs (`/opt/data/backups/{project}/{env}/`) are the correct structure — flat dirs were the source of orphaned files.
+- `ops backup <project>` without env should be a valid shorthand — it delegates `--all` to the project CLI rather than requiring explicit env arg.
+- Container name resolution logic must be identical across project CLIs — label-based primary, prefix fallback secondary. Divergence causes mysterious "container not found" bugs.
+- Old 10GB volume was kept mounted during migration to avoid cwd-in-mountpoint issues during `umount`.
+
+## Files Changed
+
+- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — fixed sl-website placement, naming consistency
+- `/opt/infrastructure/ops` — fixed arg validation, `cmd_backup` without env, backup summary in status
+- `/opt/data/seriousletter/{dev,int,prod}/code/scripts/sync/sync.py` — added `container_name()` with fallback
+- `Code/mdf-system/scripts/sync/sync.py` — per-env backup subdirectory paths
+- All compose files, systemd units — `/opt/data2` → `/opt/data` path updates
+
+---
+
+**Tags:** #Session #OpsCLI #BackupSystem #Registry
diff --git a/Notes/2026/02/0015 - 2026-02-22 - Offsite Backup Dashboard Fix & Status Format.md b/Notes/2026/02/0015 - 2026-02-22 - Offsite Backup Dashboard Fix & Status Format.md
new file mode 100644
index 0000000..90a593e
--- /dev/null
+++ b/Notes/2026/02/0015 - 2026-02-22 - Offsite Backup Dashboard Fix & Status Format.md
@@ -0,0 +1,39 @@
+# Session 0015: Offsite Backup Dashboard Fix & Status Format
+
+**Date:** 2026-02-22
+**Status:** Completed
+**Origin:** MDF Webseiten session 0025
+
+---
+
+## Work Done
+
+- [x] Fixed offsite backups not showing in ops dashboard
+ - `/api/backups/offsite` was calling `run_ops_json()` (in-container execution) but `ops offsite list` requires the host Python venv
+ - Added `run_ops_host_json()` helper to `ops_runner.py` using `nsenter`-based host execution
+ - Updated `backups.py` router to use `run_ops_host_json()` for offsite listing
+ - Rebuilt and restarted ops-dashboard container
+- [x] Reformatted backup list in `ops status` CLI output
+ - Changed from flat table sorted by project to date-grouped boxes
+ - Each date gets its own Rich table: project / env / time / size / total columns
+ - Latest backup per project/env shown, grouped by date descending, sorted by project then env within each date
+- [x] Fixed SeriousLetter backup path bug (CLI-level fix, required for dashboard data correctness)
+ - SL CLI was dumping backups flat into `/opt/data/backups/` — changed `backup-all.sh` to call SL CLI per-env with explicit `--backup-dir`
+ - Moved 15 orphaned backup files to correct per-env directories
+- [x] Ran full backup cycle across all 6 environments (MDF + SL x dev/int/prod), verified offsite upload
+
+## Key Decisions / Learnings
+
+- Dashboard containers cannot use in-process `ops` commands that require host-side Python venvs — must use `nsenter` bridge. This is a recurring pattern: in-container vs host execution boundary is an important architectural distinction in the ops-dashboard.
+- Two execution helpers needed: `run_ops_json()` (in-container, fast) and `run_ops_host_json()` (host via nsenter, required for backup/offsite commands).
+- Date-grouped backup status is more readable than a flat project-sorted table — groups make it obvious if a date was missed entirely.
+
+## Files Changed
+
+- `/opt/data/ops-dashboard/app/ops_runner.py` — added `run_ops_host_json()` helper
+- `/opt/data/ops-dashboard/app/routers/backups.py` — use host execution for offsite listing
+- `/opt/infrastructure/ops` — reformatted backup summary with date-grouped Rich tables
+
+---
+
+**Tags:** #Session #OpsDashboard #BackupSystem #Offsite
diff --git a/Notes/2026/02/0016 - 2026-02-22 - Backup Drill-Down Redesign & Restore Fix.md b/Notes/2026/02/0016 - 2026-02-22 - Backup Drill-Down Redesign & Restore Fix.md
new file mode 100644
index 0000000..9105415
--- /dev/null
+++ b/Notes/2026/02/0016 - 2026-02-22 - Backup Drill-Down Redesign & Restore Fix.md
@@ -0,0 +1,35 @@
+# Session 0016: Backup Drill-Down Redesign & Restore Fix
+
+**Date:** 2026-02-22
+**Status:** Completed
+**Origin:** MDF Webseiten session 0030
+
+---
+
+## Work Done
+
+- [x] Fixed restore API call — `mdf` CLI was falling into interactive selection because no backup filename was passed
+ - `app.js`: `startRestore()` now includes `&name=...` from `restoreCtx` in the API URL
+- [x] Implemented backups drill-down redesign (deployed as v7)
+ - Replaced flat filter state with 3-level drill state (project → env → backup file)
+ - Added cached backups to avoid re-fetching on drill-back
+ - Extracted `mergeBackups()` helper function
+ - Implemented all 13 changes from the redesign plan
+- [x] Fixed browser cache problem preventing new JS from loading after rebuild
+ - Rebuilt image and restarted container to force cache bust
+
+## Key Decisions / Learnings
+
+- Restore API must include the backup filename explicitly — passing only project/env and letting the CLI choose interactively breaks in non-TTY server context.
+- 3-level drill state (project → env → file) is the right UX pattern for hierarchical backup selection; flat filter state made navigation confusing and state management error-prone.
+- Caching fetched backup lists at each level avoids latency on drill-back and reduces server load.
+- Browser cache busting on vanilla JS apps requires either cache-control headers or a version query param — container restart alone does not always clear client caches.
+
+## Files Changed
+
+- `/opt/data/ops-dashboard/static/js/app.js` — `startRestore()` fix, 3-level drill state, `mergeBackups()` helper
+- Docker image rebuilt and container restarted
+
+---
+
+**Tags:** #Session #OpsDashboard #BackupSystem
diff --git a/Notes/2026/02/0017 - 2026-02-22 - Modular Sync Promote Rebuild Architecture.md b/Notes/2026/02/0017 - 2026-02-22 - Modular Sync Promote Rebuild Architecture.md
new file mode 100644
index 0000000..1f2e974
--- /dev/null
+++ b/Notes/2026/02/0017 - 2026-02-22 - Modular Sync Promote Rebuild Architecture.md
@@ -0,0 +1,61 @@
+# Session 0017: Modular Sync/Promote/Rebuild Architecture
+
+**Date:** 2026-02-22
+**Status:** Paused (context checkpoint)
+**Origin:** MDF Webseiten session 0032
+
+---
+
+## Work Done
+
+- [x] Fixed `SL detect_env()` — was returning "seriousletter" instead of the env name; now scans path components for first match after "data"
+- [x] Fixed `MDF list_backups()` indentation bug — try block was at same level as for loop, only parsed the last backup file
+- [x] Added `promote` config to `registry.yaml` for mdf (rsync), seriousletter (git), ringsaday (git) — each defines promote type, branch mapping, post-pull behavior
+- [x] Added `promote` Typer command to SL `sync.py` — git fetch, diff preview, git pull, Dockerfile change detection, container rebuild/restart, health check; only dev→int and int→prod allowed
+- [x] Added `cmd_promote` to ops CLI — delegates to project CLI with `--from`/`--to` args
+- [x] Added `cmd_rebuild` to ops CLI — starts containers, waits for health, restores latest backup
+- [x] Created 4 new FastAPI routers in ops-dashboard:
+ - `promote.py` — SSE streaming promote endpoint
+ - `sync_data.py` — SSE streaming sync endpoint
+ - `registry.py` — exposes project list + environments + promote config as JSON
+ - `rebuild.py` — SSE streaming rebuild/disaster-recovery endpoint
+- [x] Updated `backups.py` to read project list from registry API instead of hardcoding
+- [x] Added "Operations" page to dashboard sidebar with three sections: Promote Code, Sync Data, Rebuild (Disaster Recovery)
+- [x] Operations page uses SSE modal with dry-run toggle; project/direction buttons populated dynamically from `/api/registry/`
+- [x] Verified all 7 test categories pass
+
+## Key Decisions / Learnings
+
+- All long-running ops commands (promote, sync, rebuild) use SSE streaming — consistent with existing backup/restore pattern. The `stream_ops_host()` helper is the standard interface.
+- Registry is the single source of truth for project/environment/promote config. Dashboard reads it dynamically — no hardcoded project names in API routers.
+- Promote direction validation lives in the project CLI (`sync.py`), not in the ops CLI or dashboard — keeps enforcement close to the implementation.
+- `ops rebuild` is the disaster recovery entry point: bring up containers → wait for healthy → restore latest backup. Simple, composable.
+- `detect_env()` path parsing must handle the full `/opt/data/seriousletter/{env}/code/...` structure — scanning for VALID_ENVS after "data" in path components is robust.
+
+## Files Changed
+
+- `/opt/data/seriousletter/{dev,int,prod}/code/scripts/sync/sync.py` — fix `detect_env`, add `promote` command
+- `Code/mdf-system/scripts/sync/sync.py` (local + deployed to dev) — fix `list_backups` indentation
+- `/opt/infrastructure/servers/hetzner-vps/registry.yaml` — add `promote` config per project
+- `/opt/infrastructure/ops` — add `cmd_promote`, `cmd_rebuild`
+- `/opt/data/ops-dashboard/app/routers/promote.py` — new SSE promote endpoint
+- `/opt/data/ops-dashboard/app/routers/sync_data.py` — new SSE sync endpoint
+- `/opt/data/ops-dashboard/app/routers/registry.py` — new registry JSON endpoint
+- `/opt/data/ops-dashboard/app/routers/rebuild.py` — new SSE rebuild endpoint
+- `/opt/data/ops-dashboard/app/routers/backups.py` — dynamic project list from registry
+- `/opt/data/ops-dashboard/app/main.py` — register 4 new routers
+- `/opt/data/ops-dashboard/static/js/app.js` — Operations page UI + SSE modal
+- `/opt/data/ops-dashboard/static/index.html` — nav link + ops-modal HTML
+
+## Next Steps (at time of pause)
+
+- [ ] Test backup creation from dashboard UI
+- [ ] Test full promote dry-run via dashboard (Operations page)
+- [ ] Test sync dry-run via dashboard
+- [ ] Commit infrastructure and code repo changes on server
+- [ ] DNS cutover mdf-system.de → .ch
+- [ ] Disaster recovery test (destroy + rebuild SL dev)
+
+---
+
+**Tags:** #Session #OpsDashboard #OpsCLI #Promote #Sync #Rebuild #Registry
diff --git a/Notes/2026/02/0018 - 2026-02-22 - CLI Contract Spec, Sync Compliance, Dashboard Bidirectional UI.md b/Notes/2026/02/0018 - 2026-02-22 - CLI Contract Spec, Sync Compliance, Dashboard Bidirectional UI.md
new file mode 100644
index 0000000..10b5b64
--- /dev/null
+++ b/Notes/2026/02/0018 - 2026-02-22 - CLI Contract Spec, Sync Compliance, Dashboard Bidirectional UI.md
@@ -0,0 +1,44 @@
+# Session 0018: CLI Contract Spec, Sync Compliance, Dashboard Bidirectional UI
+
+**Date:** 2026-02-22
+**Status:** Completed
+**Origin:** MDF Webseiten session 0033
+
+---
+
+## Work Done
+
+- [x] Defined project CLI contract (`infrastructure/docs/cli-contract.md`, 514 lines): 4 required commands (backup, restore, sync, promote), exact flags, exit codes, output format, compliance checklist, minimal shell CLI example for new projects
+- [x] MDF sync.py contract compliance: ANSI suppression (NO_COLOR env var + TTY detection), `--yes` flag for backup, 6 cancellation paths changed exit 0 → exit 2, `[error]` prefix helper for stderr
+- [x] SL sync.py contract compliance: ANSI suppression, `error_exit()` helper, backup now uses per-env subdirectories, absolute path output after backup
+- [x] Ops CLI de-hardcoding: removed stale `/opt/data2` from disk checks and healthcheck.sh, generalized hardcoded MDF-specific comments, added `find_registry()` multi-server comment
+- [x] Disaster recovery docs: fixed `install.sh` (single-volume layout, auto-detection), fixed `bootstrap.sh` (network pre-creation, local image builds, restore instructions), wrote `docs/disaster-recovery.md` (10-phase runbook)
+- [x] Dashboard JS fix: fixed syntax errors in Operations page onclick handlers (nested quotes)
+- [x] Permanent cache fix: content-hashed asset URLs so manual `?v=XX` bumps are no longer needed
+- [x] Bidirectional sync UI: `prod ↔ dev` with direction picker modal ("content flows down" / "content flows up")
+- [x] Deployed to server: ops CLI, registry, healthcheck, install.sh, bootstrap.sh, both sync.py scripts (all 3 envs), dashboard rebuilt with content hashing
+- [x] Verified: ops status, ops health, promote dry-run, restore --list, dashboard SSE streaming
+
+## Key Decisions / Learnings
+
+- CLI contract enforces: ANSI off via `NO_COLOR` or non-TTY detection; exit codes 0 (success), 1 (error), 2 (cancelled by user); `[error]` prefix on stderr; `--yes` flag to skip prompts in automation
+- Cancellation paths must exit 2, not 0 — exit 0 was masking user-cancelled operations in the dashboard
+- Content hashing (not version query params) is the correct long-term cache-busting solution
+- `find_registry()` multi-server support is documented but not yet implemented — placeholder for future
+- DR runbook is 10 phases: verify backups → restore server → install deps → clone repo → restore data → start services → verify
+
+## Files Changed
+
+- `infrastructure/docs/cli-contract.md` — new, 514 lines, defines the full CLI contract
+- `infrastructure/docs/disaster-recovery.md` — new, 10-phase DR runbook
+- `infrastructure/install.sh` — single-volume layout with auto-detection
+- `infrastructure/bootstrap.sh` — network pre-creation, local image builds, restore instructions
+- `infrastructure/ops` — removed `/opt/data2`, generalized hardcoded comments, `find_registry()` note
+- `infrastructure/healthcheck.sh` — removed stale `/opt/data2` disk check
+- `Code/mdf-system/scripts/sync/sync.py` — ANSI suppression, `--yes`, exit 2 cancellations, `[error]` helper
+- `Code/seriousletter-sync/sync.py` — ANSI suppression, `error_exit()`, per-env backup dirs, absolute path output
+- `Code/ops-dashboard/` — JS onclick fix, content-hashed assets, bidirectional sync UI
+
+---
+
+**Tags:** #Session #OpsToolkit #OpsDashboard #CliContract #DisasterRecovery
diff --git a/Notes/2026/02/0019 - 2026-02-22 - Offsite Download Feature Added to Dashboard.md b/Notes/2026/02/0019 - 2026-02-22 - Offsite Download Feature Added to Dashboard.md
new file mode 100644
index 0000000..c2bf7e7
--- /dev/null
+++ b/Notes/2026/02/0019 - 2026-02-22 - Offsite Download Feature Added to Dashboard.md
@@ -0,0 +1,27 @@
+# Session 0019: Offsite Download Feature Added to Dashboard
+
+**Date:** 2026-02-22
+**Status:** Completed
+**Origin:** MDF Webseiten session 0039
+
+---
+
+## Work Done
+
+- [x] Added offsite download feature to ops dashboard: per-row download buttons on the Backups page plus action bar buttons
+- [x] Offsite download uses SSE streaming (consistent with existing backup/restore/upload patterns)
+- [x] Updated ops registry with Seafile services (adds ops-visible services to status output)
+
+## Key Decisions / Learnings
+
+- Offsite download follows the same SSE streaming pattern as backup upload — consistency across all long-running operations
+- Per-row buttons (individual file download) and action bar buttons (bulk/selected) both supported
+
+## Files Changed
+
+- `Code/ops-dashboard/` — offsite download UI (per-row + action bar) with SSE streaming
+- `infrastructure/servers/hetzner-vps/registry.yaml` — added Seafile services
+
+---
+
+**Tags:** #Session #OpsDashboard #Offsite #SSE
diff --git a/Notes/2026/02/0020 - 2026-02-23 - Backup Coverage Audit, Registry Fixes, Container Resolution.md b/Notes/2026/02/0020 - 2026-02-23 - Backup Coverage Audit, Registry Fixes, Container Resolution.md
new file mode 100644
index 0000000..9f7ef9b
--- /dev/null
+++ b/Notes/2026/02/0020 - 2026-02-23 - Backup Coverage Audit, Registry Fixes, Container Resolution.md
@@ -0,0 +1,35 @@
+# Session 0020: Backup Coverage Audit, Registry Fixes, Container Resolution
+
+**Date:** 2026-02-23
+**Status:** Completed
+**Origin:** MDF Webseiten session 0041
+
+---
+
+## Work Done
+
+- [x] Fixed ringsaday backup error: added `backup_sources` (volumes, keys, server, website, .env) and `backup` config to registry; changed `backup_dir` to `/opt/data/backups/ringsaday`; fixed `_backup_generic()` — changed `-d` to `-e` flag so individual files (not just directories) can be backed up; tested: 689 MB backup created successfully
+- [x] Full backup coverage audit: identified kioskpilot (1.3 MB) and ops-dashboard (1.5 MB) as missing backups
+- [x] Added kioskpilot backup (03:45, 30-day retention)
+- [x] Added ops-dashboard to registry + nightly backup (04:15, 30-day retention)
+- [x] Now 6 nightly backup timers: mdf, seriousletter, ringsaday, kioskpilot, ops-dashboard, coolify
+- [x] Fixed ringsaday container resolution: was showing duplicated entries in `ops status`
+ - Added `{prefix}-{env}-` matching pattern to `find_containers()` (handles ringsaday-dev-UUID style names)
+ - Added ringsaday-website as sub-service with `environments: [prod]`
+- [x] Deployed registry.yaml and ops CLI to server; 6 systemd timers active; backup dirs created
+
+## Key Decisions / Learnings
+
+- `_backup_generic()` used `-d` (directory flag) which silently skipped individual files like `.env` and SSL keys — the fix to `-e` (existence check) makes it handle both files and directories
+- Container naming for ringsaday uses `{prefix}-{env}-UUID` (Coolify-managed), different from other projects — `find_containers()` needed a second pattern to match these
+- ops-dashboard itself must be backed up — it holds its own config and data, easy to overlook
+- Backup coverage audit should be a recurring check whenever new projects are added
+
+## Files Changed
+
+- `infrastructure/servers/hetzner-vps/registry.yaml` — kioskpilot backup, ops-dashboard entry, ringsaday website sub-service, ringsaday backup_sources
+- `infrastructure/ops` — `_backup_generic()` -d→-e fix, `find_containers()` new UUID-style pattern
+
+---
+
+**Tags:** #Session #OpsToolkit #Backup #Registry #ContainerResolution
diff --git a/Notes/2026/02/0021 - 2026-02-23 - Rebuild.py Coolify-Only Lifecycle, SSE Keepalive, Traefik Flush.md b/Notes/2026/02/0021 - 2026-02-23 - Rebuild.py Coolify-Only Lifecycle, SSE Keepalive, Traefik Flush.md
new file mode 100644
index 0000000..b98061a
--- /dev/null
+++ b/Notes/2026/02/0021 - 2026-02-23 - Rebuild.py Coolify-Only Lifecycle, SSE Keepalive, Traefik Flush.md
@@ -0,0 +1,35 @@
+# Session 0021: Rebuild.py Coolify-Only Lifecycle, SSE Keepalive, Traefik Flush
+
+**Date:** 2026-02-23
+**Status:** Completed
+**Origin:** MDF Webseiten session 0044 (part 1)
+
+---
+
+## Work Done
+
+- [x] rebuild.py — removed all docker compose fallbacks; recreate is now Coolify stop → wipe → Coolify start; rebuild is Coolify stop → docker build → Coolify start; restart stays as `docker restart` (Coolify restart prunes local images — intentional exception)
+- [x] Fixed build step: changed from `docker compose --profile {env} build` (requires all Coolify env vars) to `docker build -t {image}:{env} {context}` using registry `build_context` and `image_name` directly — no env vars needed
+- [x] Added `_coolify_start_with_retry()`: polls 60s after API call, retries up to 3 times — handles Coolify silently dropping start requests
+- [x] Container stabilization polling: `_poll_until_running` now waits for container count to be stable for 2 consecutive polls (10s) before declaring success — previously returned success on first container appearance
+- [x] "Already running/stopped" handling: Coolify API HTTP 400 with that message now treated as success, not error
+- [x] SSE keepalive for restore: restore connections were dropping during DB import (~60s silence); added `_stream_with_keepalive()` wrapper in `restore.py` — sends SSE comment `: keepalive` every 15s
+- [x] Added `responseForwarding.flushInterval: "-1"` to ops-dashboard Traefik dynamic config — Traefik was buffering SSE responses, causing keepalives to not reach the client
+
+## Key Decisions / Learnings
+
+- Coolify `restart` prunes locally-built images — `docker restart` (bypassing Coolify) is the correct approach for services with local images; this is a documented exception in rebuild.py
+- Coolify can silently queue-and-never-execute start requests — retry logic with polling is mandatory, not optional
+- "Already running" from Coolify API is a valid state (idempotent), not an error — treat HTTP 400 with that message as success
+- SSE keepalive must happen at the application level (`: keepalive` comment) AND Traefik must be configured to flush immediately (`flushInterval: "-1"`) — both are required; one alone is not enough
+- Stable polling (2 consecutive matching counts) is more reliable than "at least one container appeared"
+
+## Files Changed
+
+- `Code/ops-dashboard/app/routers/rebuild.py` — Coolify-only lifecycle, `docker build` from registry config, `_coolify_start_with_retry()`, stable container polling, HTTP 400 success handling
+- `Code/ops-dashboard/app/routers/restore.py` — `_stream_with_keepalive()` SSE keepalive wrapper
+- Server: `/data/coolify/proxy/dynamic/ops-dashboard.yaml` — added `responseForwarding.flushInterval: "-1"`
+
+---
+
+**Tags:** #Session #OpsDashboard #Rebuild #SSE #Traefik #Coolify
diff --git a/Notes/2026/02/0022 - 2026-02-23 - Post-Coolify Architecture Context for Ops Toolkit.md b/Notes/2026/02/0022 - 2026-02-23 - Post-Coolify Architecture Context for Ops Toolkit.md
new file mode 100644
index 0000000..2d8c973
--- /dev/null
+++ b/Notes/2026/02/0022 - 2026-02-23 - Post-Coolify Architecture Context for Ops Toolkit.md
@@ -0,0 +1,52 @@
+# Session 0022: Post-Coolify Architecture Context for Ops Toolkit
+
+**Date:** 2026-02-23
+**Status:** Completed
+**Origin:** MDF Webseiten session 0044 (Coolify Removal Complete)
+
+---
+
+## Work Done
+
+- [x] Coolify fully removed from server (6 containers, 18 UUID networks, /data/coolify/ directory)
+- [x] Standalone Traefik v3.6 confirmed as proxy layer (was coolify-proxy, now independent at /opt/data/traefik/)
+- [x] All 28 containers verified operational post-removal; 17/17 domains tested
+- [x] Dynamic configs migrated: seriousletter.yaml, ringsaday.yaml moved to /opt/data/traefik/dynamic/
+- [x] SSL certificates preserved: acme.json migrated to /opt/data/traefik/acme.json
+- [x] Coolify archive retained: /opt/data/backups/coolify-final-20260223.tar.gz (125KB, 30-day window)
+
+## Key Decisions / Learnings
+
+- **Ops toolkit no longer depends on Coolify API** — all lifecycle management (start/stop/rebuild/recreate) must use Docker CLI and docker compose directly against project compose files at `/opt/data/{project}/`
+- **Container naming is now clean** — no more UUID suffixes. Pattern: `{env}-{project}-{service}` (e.g. `prod-mdf-wordpress`, `dev-seriousletter-backend`)
+- **Proxy network is `proxy`** (replaces old `coolify` network) — all Traefik-exposed containers connect to it
+- **Project descriptors at `/opt/data/{project}/project.yaml`** are the new source of truth for container config — registry.yaml is deprecated (used only by gen-timers and schedule PUT)
+- **Docker provider + file provider** coexist in Traefik: MDF services use Docker labels; SeriousLetter, RingsADay, KioskPilot use file provider configs
+- metro.ringsaday.com returns 502 — pre-existing issue unrelated to Coolify removal (no metro service in compose)
+- Docker system cleanup freed ~9GB of unused images and volumes during removal
+
+## Architecture Reference (Post-Coolify)
+
+```
+Proxy: Traefik v3.6 at /opt/data/traefik/
+Config: traefik.yaml (static), dynamic/ (file provider)
+Certs: /opt/data/traefik/acme.json
+Proxy network: proxy
+
+Projects:
+ MDF prod: /opt/data/mdf/prod/ — WordPress, MySQL, Mail, PostfixAdmin, Roundcube, Seafile
+ MDF int/dev: /opt/data/mdf/{int,dev}/ — WordPress + MySQL
+ SeriousLetter: /opt/data/seriousletter/{dev,int,prod}/
+ RingsADay: /opt/data/ringsaday/
+ KioskPilot: /opt/data/kioskpilot/
+ Ops Dashboard: /opt/data/ops-dashboard/
+```
+
+## Files Changed
+
+- Server: `/data/coolify/` — deleted (backed up first)
+- Server: `/opt/data/traefik/dynamic/` — received migrated seriousletter.yaml and ringsaday.yaml
+
+---
+
+**Tags:** #Session #OpsToolkit #Architecture #Traefik #PostCoolify
diff --git a/Notes/2026/02/0023 - 2026-02-23 - Toolkit Bootstrap Starting Point.md b/Notes/2026/02/0023 - 2026-02-23 - Toolkit Bootstrap Starting Point.md
new file mode 100644
index 0000000..397bc94
--- /dev/null
+++ b/Notes/2026/02/0023 - 2026-02-23 - Toolkit Bootstrap Starting Point.md
@@ -0,0 +1,44 @@
+# Session 0023: Toolkit Bootstrap Starting Point
+
+**Date:** 2026-02-23
+**Status:** Completed
+**Origin:** MDF Webseiten session 0045
+
+---
+
+## Work Done
+
+- [x] Created `project.yaml` descriptors for all 5 projects (mdf, seriousletter, ringsaday, kioskpilot, ops-dashboard)
+- [x] Updated `ops-dashboard` docker-compose.yaml: network `coolify` → `proxy`
+- [x] Added Alpine pre-pull with retry (4 attempts, 15s delays) to `rebuild.py` — note: this was a pre-redesign patch, superseded by Phase 5 rewrite in session 0046
+- [x] Added image verification after build to `rebuild.py`
+- [x] Identified Phase 3+4 toolkit work as next immediate task (was interrupted this session)
+
+## Context / Background
+
+This session was primarily about removing Coolify and migrating all projects to standalone Docker Compose. The OPS-relevant outcome is:
+
+- All 5 `project.yaml` descriptors now exist and are the source of truth for the toolkit
+- The `proxy` Docker network replaces the old `coolify` network — all Traefik-exposed containers connect to it
+- The toolkit build (Phase 3+4) was planned but interrupted mid-session — completed in session 0046
+- The plan was documented at: `Notes/swarm/plan.md` (since cleaned up)
+
+## Key Decisions / Learnings
+
+- `container_prefix` in `project.yaml` uses `{env}` placeholder (e.g. `"{env}-mdf"`) — the toolkit must expand this at runtime
+- SeriousLetter uses `"{env}-seriousletter"` as prefix (not `sl`)
+- ops-dashboard gets its own `project.yaml` like all other projects
+
+## Files Changed
+
+- `/opt/data/mdf/project.yaml` — created
+- `/opt/data/seriousletter/project.yaml` — created
+- `/opt/data/ringsaday/project.yaml` — created
+- `/opt/data/kioskpilot/project.yaml` — created
+- `/opt/data/ops-dashboard/project.yaml` — created
+- `/opt/data/ops-dashboard/docker-compose.yml` — network coolify→proxy
+- `app/routers/rebuild.py` — Alpine retry + image verify (pre-redesign, superseded)
+
+---
+
+**Tags:** #Session #OpsToolkit #Infrastructure
diff --git a/Notes/2026/02/0024 - 2026-02-23 - Toolkit and CLI Rewrite and Dashboard Migration.md b/Notes/2026/02/0024 - 2026-02-23 - Toolkit and CLI Rewrite and Dashboard Migration.md
new file mode 100644
index 0000000..d45120e
--- /dev/null
+++ b/Notes/2026/02/0024 - 2026-02-23 - Toolkit and CLI Rewrite and Dashboard Migration.md
@@ -0,0 +1,65 @@
+# Session 0024: Toolkit and CLI Rewrite and Dashboard Migration
+
+**Date:** 2026-02-23
+**Status:** Completed
+**Origin:** MDF Webseiten session 0046
+
+---
+
+## Work Done
+
+### Phase 3: Shared Toolkit
+
+- [x] Completed 5 missing toolkit modules at `/opt/infrastructure/toolkit/`:
+ - `cli.py` — main CLI entry point with all commands (status, start, stop, build, rebuild, destroy, backup, restore, sync, promote, logs, health, disk, backups, offsite, gen-timers, init)
+ - `output.py` — formatted output (Rich tables, JSON mode, plain text fallback)
+ - `restore.py` — restore operations with CLI delegation support
+ - `sync.py` — data sync between environments with CLI delegation
+ - `promote.py` — code promotion (git, rsync, script) with adjacency enforcement
+- [x] 7 modules already existed from prior sessions: `__init__.py`, `descriptor.py`, `docker.py`, `backup.py`, `database.py`, `health.py`, `discovery.py`
+
+### Phase 4: Ops CLI Rewrite
+
+- [x] Replaced 950-line bash ops CLI with 7-line bash shim → `python3 -m toolkit.cli`
+- [x] Old CLI backed up as `ops.bak.20260223`
+- [x] New commands added: `start`, `stop`, `build`, `destroy`, `logs`, `restart`, `init`
+- [x] All commands read from `project.yaml` descriptors — no `registry.yaml` dependency
+- [x] Container prefix matching fixed: handles `{env}` placeholder expansion in `container_prefix`
+
+### Phase 5: Dashboard Adaptation
+
+- [x] Rewrote 4 dashboard routers to use project.yaml:
+ - `registry.py` — imports `toolkit.discovery.all_projects()` instead of parsing registry.yaml
+ - `services.py` — uses `toolkit.descriptor.find()` for container name resolution
+ - `rebuild.py` — massive rewrite: 707 → 348 lines, removed ALL Coolify API code, uses direct docker compose
+ - `schedule.py` — reads from descriptors for GET, still writes to registry.yaml for PUT (gen-timers compatibility)
+- [x] Verified all API endpoints working:
+ - `/api/registry/` — returns all 5 projects from descriptors
+ - `/api/status/` — shows 25 containers
+ - `/api/schedule/` — shows backup schedules for all 5 projects
+ - `/api/services/logs/mdf/prod/wordpress` — correctly resolves container name
+
+## Key Decisions / Learnings
+
+- `rebuild.py` now uses `_compose_cmd()` helper that finds compose file (.yaml/.yml), env-file (.env.{env}/.env), and adds `--profile {env}` — removes all Coolify API dependency
+- Dashboard container has `/opt/infrastructure` mounted → can import toolkit directly via Python
+- pyyaml 6.0.3 confirmed available in dashboard container
+- `schedule.py` still writes to `registry.yaml` for PUT/gen-timers — full descriptor migration is a future task
+- `container_prefix_for(env)` expands `{env}` in prefix, then matches `{prefix}-*` containers
+
+## Files Changed
+
+- `/opt/infrastructure/toolkit/cli.py` — new (all CLI commands)
+- `/opt/infrastructure/toolkit/output.py` — new (Rich/JSON/plain output)
+- `/opt/infrastructure/toolkit/restore.py` — new
+- `/opt/infrastructure/toolkit/sync.py` — new
+- `/opt/infrastructure/toolkit/promote.py` — new
+- `/usr/local/bin/ops` — rewritten as 7-line bash shim
+- `app/routers/registry.py` — uses toolkit.discovery
+- `app/routers/services.py` — uses toolkit.descriptor
+- `app/routers/rebuild.py` — 707→348 lines, Coolify removed
+- `app/routers/schedule.py` — descriptor-backed GET
+
+---
+
+**Tags:** #Session #OpsToolkit #OpsCLI #OpsDashboard
diff --git a/Notes/2026/02/0025 - 2026-02-24 - Dashboard Bugs and SL Routing Fixes.md b/Notes/2026/02/0025 - 2026-02-24 - Dashboard Bugs and SL Routing Fixes.md
new file mode 100644
index 0000000..612aaad
--- /dev/null
+++ b/Notes/2026/02/0025 - 2026-02-24 - Dashboard Bugs and SL Routing Fixes.md
@@ -0,0 +1,62 @@
+# Session 0025: Dashboard Bugs and SL Routing Fixes
+
+**Date:** 2026-02-24
+**Status:** Completed
+**Origin:** MDF Webseiten session 0048 (Part 2 only — DNS cutover and mail recovery sections skipped)
+
+---
+
+## Work Done
+
+### Operations Page: Recreate Replaced by Backup + Restore
+
+- [x] Removed "Recreate" lifecycle action (redundant with Rebuild for bind-mount projects)
+- [x] Added **Backup** button (blue): opens lifecycle modal with SSE streaming to `/api/backups/stream/{project}/{env}`
+- [x] Added **Restore** button (purple): navigates to Backups page at drill level 2 for that project/env
+- [x] Added cache invalidation on backup success
+
+### SeriousLetter Bad Gateway Fix
+
+- [x] Diagnosed root cause: SL containers only on `seriousletter-network`, not on `proxy` network Traefik uses
+- [x] Permanent fix: added `proxy` network to docker-compose.yaml for all 3 SL envs (prod/int/dev)
+ - `backend` and `frontend` services get `proxy` in networks list
+ - `proxy: external: true` added to networks section
+- [x] Added health checks for both services:
+ - Backend: `python3 urllib.request.urlopen("http://localhost:8000/docs")`
+ - Frontend: `wget --spider -q http://127.0.0.1:3000/` (explicit `127.0.0.1`, not `localhost` — Alpine resolves to IPv6 `::1`)
+
+### Sync Routing Bug Fix
+
+- [x] Fixed sync section only showing MDF (not SeriousLetter)
+- [x] Root cause (two-part):
+ 1. `registry.py` had `desc.sync.get("type") == "cli"` — SL had `sync.type: toolkit`, evaluated to `False`
+ 2. SL's `toolkit` type was itself wrong — should be `cli` with a CLI path
+- [x] Fix in `registry.py`: `"has_cli": desc.sync.get("type") == "cli"` → `"has_cli": bool(desc.sync.get("type"))`
+- [x] Fix in `/opt/data/seriousletter/project.yaml`: `sync.type: toolkit` → `type: cli` with `cli:` path
+
+### Backup Date Inconsistency Fix
+
+- [x] Fixed overview card showing stale "INT Latest" date while drill-down showed correct newer backups
+- [x] Root cause: string comparison between incompatible date formats:
+ - Compact (MDF CLI): `20260220_195300`
+ - ISO (toolkit): `2026-02-24T03:00:42`
+ - Character `'0' > '-'` meant compact dates always "won" the `>` comparison
+- [x] Fix: added `normalizeBackupDate()` function to convert all dates to ISO format at merge time in `mergeBackups()`
+
+## Key Decisions / Learnings
+
+- When adding a container to a new network, an ad-hoc `docker network connect` is lost on restart — the fix must go in the compose file
+- Alpine `localhost` resolves to `::1` (IPv6). Services binding only IPv4 `0.0.0.0` won't respond. Use `127.0.0.1` explicitly in health checks.
+- For `has_cli` logic: any truthy `sync.type` value means the project has ops CLI support — don't compare to a specific string
+- Date normalization must happen at merge time, not display time, to get correct `max()` comparisons
+
+## Files Changed
+
+- `static/js/app.js` — removed recreate modal/handler, added backup modal, URL routing for restore button, cache invalidation, `normalizeBackupDate()` + `mergeBackups()` fix
+- `app/routers/registry.py` — `has_cli` logic fix
+- `/opt/data/seriousletter/project.yaml` — `sync.type` corrected
+- `/opt/data/seriousletter/{prod,int,dev}/code/docker-compose.yaml` — proxy network + health checks
+
+---
+
+**Tags:** #Session #OpsDashboard #BugFix
diff --git a/Notes/2026/02/0026 - 2026-02-25 - Persistent Jobs and Container Terminal.md b/Notes/2026/02/0026 - 2026-02-25 - Persistent Jobs and Container Terminal.md
new file mode 100644
index 0000000..c28a27b
--- /dev/null
+++ b/Notes/2026/02/0026 - 2026-02-25 - Persistent Jobs and Container Terminal.md
@@ -0,0 +1,69 @@
+# Session 0026: Persistent Jobs and Container Terminal
+
+**Date:** 2026-02-25
+**Status:** Completed
+**Origin:** MDF Webseiten session 0053
+
+---
+
+## Work Done
+
+### Feature 1: Persistent/Reconnectable Jobs
+
+- [x] New `app/job_store.py` — in-memory job store decouples subprocess from SSE connection
+- [x] New `app/routers/jobs.py` — job management endpoints
+- [x] New endpoints: `GET /api/jobs/`, `GET /api/jobs/{op_id}`, `GET /api/jobs/{op_id}/stream?from=N`
+- [x] Added `run_job()` to `ops_runner.py` — runs subprocess writing to job store, NOT killed on browser disconnect
+- [x] Added `job_sse_stream()` to `job_store.py` — shared SSE wrapper with keepalive
+- [x] Rewrote 6 routers to use job store pattern: backups.py, restore.py, sync_data.py, promote.py, rebuild.py, schedule.py
+- [x] All routers follow pattern: `create_job()` → `asyncio.create_task(run_job())` → `return StreamingResponse(job_sse_stream())`
+- [x] Background cleanup task removes expired jobs every 5 minutes (1 hour TTL)
+- [x] Frontend: auto-reconnect on SSE error via `/api/jobs/{op_id}/stream?from=N` (3 retries)
+- [x] Frontend: check for running jobs on page load, show reconnect banner
+
+### Feature 2: Container Terminal
+
+- [x] New `app/routers/terminal.py` — WebSocket endpoint with PTY via `docker exec`
+- [x] Protocol: `{"type":"input","data":"..."}` / `{"type":"resize","cols":80,"rows":24}` / `{"type":"output","data":"..."}`
+- [x] Frontend: xterm.js 5.5.0 + addon-fit from CDN, terminal modal, Console button on services page
+- [x] Security: token auth, container name validation (regex allowlist), running check via docker inspect
+
+### Fixes Applied
+
+- [x] Restored bidirectional sync pairs in `sync_data.py` (regression from engineer rewrite)
+- [x] Restored multi-compose support in `rebuild.py` (`_all_compose_dirs`, `_compose_cmd_for` for Seafile)
+- [x] Updated `main.py` with jobs + terminal routers, cleanup task in lifespan
+- [x] Bumped APP_VERSION to v15-20260225
+- [x] Also committed + pushed `sync_data.py` bidirectional fix (git commit 31ac43f) and stabilization checks
+
+## Key Decisions / Learnings
+
+- Decoupling subprocess from SSE via a job store is the correct pattern — browser disconnect should never kill a running backup/restore
+- Job store is in-memory (not persisted) — server restart loses job history, which is acceptable
+- xterm.js from CDN (not bundled) keeps the container image lean
+- Container name validation via regex allowlist prevents command injection through the WebSocket terminal endpoint
+- `from=N` query param on stream endpoint enables replay from any position — client tracks last received line index
+
+## Files Changed
+
+- `app/job_store.py` — new (315 lines)
+- `app/routers/jobs.py` — new (186 lines)
+- `app/routers/terminal.py` — new (287 lines)
+- `app/ops_runner.py` — added `run_job()` (388 lines total)
+- `app/main.py` — added routers + cleanup task (138 lines)
+- `app/routers/backups.py` — job store integration (287 lines)
+- `app/routers/restore.py` — job store integration (290 lines)
+- `app/routers/sync_data.py` — job store + bidirectional fix (71 lines)
+- `app/routers/promote.py` — job store integration (69 lines)
+- `app/routers/rebuild.py` — job store + multi-compose (365 lines)
+- `static/js/app.js` — v15: reconnect + terminal (2355 lines)
+- `static/index.html` — xterm.js CDN + terminal modal
+- `static/css/style.css` — terminal styles
+
+## State at Session End
+
+Code written locally at `/Users/i052341/Daten/Cloud/08 - Others/MDF/Infrastruktur/Code/ops-dashboard/`. Not yet deployed to server at time of note creation. Deploy + verification is the next session's starting task.
+
+---
+
+**Tags:** #Session #OpsDashboard #PersistentJobs #Terminal
--
Gitblit v1.3.1