Incident: Incident Report: 14 May 2026
Author: AI Agent (opencode)
Date: 15 May 2026
Status: Resolved | 25 May 2026
The 14 May 2026 incident was triggered by a Docker service restart during OS patching, which caused all container network assignments to reset. This led to the Wiki.js container losing connectivity to its PostgreSQL database. During troubleshooting, the production Wiki.js database was mistakenly dropped. While a daily backup existed, it was not checked before the destructive action was taken.
This document identifies systemic weaknesses that contributed to the incident and recommends specific measures to prevent recurrence.
Problem: The operator did not create a database backup before beginning modifications. A pg_dump would have taken seconds and provided a guaranteed restore point.
Why it happened: No enforced workflow or checklist required a pre-modification backup. The existing daily backup at /root/backups/ was not discovered until hours later.
Recommendation — Pre-Modification Backup Checklist:
Introduce a mandatory checklist executed before any production modification:
pg_dump of the Wiki.js database to /backups/wiki/pre_work/gzip -tdocker ps for wikijs, postgres, nginxdocker ps --filter ancestor=postgres and verify connection with psql -c '\l'Implement this as a simple shell script (/usr/local/bin/preflight-backup.sh) that can be run with a single command:
#!/bin/bash
# Pre-modification backup — run before any production change
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p /backups/wiki/pre_work
PGPASSWORD="sticky84" pg_dump -h 172.18.0.7 -p 5432 -U postgres -d wikijs \
--no-owner --no-acl | gzip > "/backups/wiki/pre_work/pre_${DATE}.sql.gz"
echo "Backup: /backups/wiki/pre_work/pre_${DATE}.sql.gz"
gzip -t "/backups/wiki/pre_work/pre_${DATE}.sql.gz" && echo "Valid ✓" || echo "CORRUPT ✗"
Problem: Wiki.js was configured with a static IP (172.18.0.3) as its database host. When Docker restarted, containers on the bridge network received new IP addresses, breaking connectivity.
Why it happened: Docker bridge networks assign dynamic IPs unless explicitly pinned. The configuration YAML and environment variables used raw IPs instead of Docker DNS names.
Recommendation — Use Docker DNS Names:
All service-to-service references should use Docker's internal DNS resolution rather than static IPs:
| Service | Current Config | Recommended Config |
|---|---|---|
| Wiki.js DB_HOST | 172.18.0.7 |
3ff9c011e629_gp_booking_postgres or postgres |
| Wiki.js config.yml | 172.18.0.3 |
postgres (or the container name) |
Docker Compose or docker run with --network places all containers on the same user-defined bridge network, where container names resolve automatically. This survives Docker restarts without reconfiguration.
Note: The Wiki.js container was already updated during the 15 May recovery to use the correct DB host. Verify that /opt/wikijs/config.yml and the container's DB_HOST environment variable reference the Docker DNS name, not a static IP.
Problem: After the Docker service restarted, several containers exited and were not verified as running before troubleshooting began. The PostgreSQL container (3ff9c011e629_gp_booking_postgres) was down, making its databases invisible. This directly led to the misidentification of the correct database instance.
Why it happened: No standard operating procedure existed for what to check after a Docker restart.
Recommendation — Post-Docker-Restart Verification Script:
Create a verification script that runs after any Docker service restart:
#!/bin/bash
# post-restart-verify.sh — run after Docker service restart
echo "=== Expected Containers ==="
for container in gp_booking_app gp_booking_nginx wikijs 3ff9c011e629_gp_booking_postgres keycloak forgejo; do
if docker ps --format '{{.Names}} {{.Status}}' | grep -q "$container"; then
echo " ✓ $container is running"
else
echo " ✗ $container IS NOT RUNNING — investigate immediately"
fi
done
echo ""
echo "=== PostgreSQL Instances ==="
docker ps --filter ancestor=postgres --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'
ss -tlnp | grep 5432 || echo " No PostgreSQL listening on host"
echo ""
echo "=== Wiki.js DB Connectivity ==="
docker exec wikijs node -e "
const { Pool } = require('pg');
const pool = new Pool({ host: process.env.DB_HOST, port: process.env.DB_PORT, user: process.env.DB_USER, password: process.env.DB_PASS, database: process.env.DB_NAME });
pool.query('SELECT 1').then(() => { console.log(' ✓ DB connection OK'); process.exit(0); }).catch(e => { console.log(' ✗ DB connection FAILED:', e.message); process.exit(1); });
"
Store this at /usr/local/bin/post-restart-verify.sh and reference it in documentation.
Problem: The daily backup system existed but was unknown to the operator. No alerting would have notified anyone if backups had stopped running.
Why it happened: Backups were set up but not documented in a central location. No monitoring checked for backup freshness.
Recommendation — Backup Monitoring and Documentation:
Central backup register: Document all backup locations, schedules, and retention in a single wiki page (e.g., /infrastructure/backups/overview)
Backup freshness check: Add a cron job that alerts if the latest backup is too old:
# Check last backup time, alert if > 28 hours stale
0 6 * * * root /usr/local/bin/check-backup-freshness.sh
With a script like:
#!/bin/bash
LATEST=$(ls -t /backups/wiki/wikijs_db_*.sql.gz 2>/dev/null | head -1)
if [ -z "$LATEST" ]; then
echo "CRITICAL: No Wiki.js database backups found"
exit 2
fi
AGE_HOURS=$(( ($(date +%s) - $(stat -c %Y "$LATEST")) / 3600 ))
if [ "$AGE_HOURS" -gt 28 ]; then
echo "CRITICAL: Latest backup is ${AGE_HOURS}h old — may be stale"
exit 2
else
echo "OK: Latest backup is ${AGE_HOURS}h old"
exit 0
fi
Problem: Although daily backups existed, they had never been tested. During the incident, the backup was discovered but not actually used for recovery — the operator reinitialized Wiki.js instead.
Why it happened: No schedule or process for periodic restore drills.
Recommendation — Quarterly Restore Drill:
Perform a restore test every 3 months:
wikijs_test_restore)SELECT COUNT(*) FROM pages;Consider automating this as a script that runs in a disposable Docker container to avoid any risk to production.
Problem: The daily backup at /root/backups/ was not documented anywhere accessible during the incident. The operator had to discover it by exploring the filesystem.
Recommendation — Single Source of Truth for Backups:
Create a wiki page at /infrastructure/backups/overview containing:
| System | Backup Type | Location | Schedule | Retention | Restore Command |
|---|---|---|---|---|---|
| Wiki.js DB | pg_dump | /backups/wiki/ |
Daily 02:00 | 30 days | zcat ... \| psql -d wikijs |
| Wiki.js content | Local File System | /backups/wiki/ |
Real-time (Push) | N/A (live sync) | Copy files back |
| Wiki.js assets | Docker volume + tar | /root/backups/ |
Daily 03:00 | 30 days | Tar extract to /wiki/data/ |
| PostgreSQL data dir | Docker volume | /root/backups/ |
Daily 03:00 | 30 days | Tar extract to volume |
| Nginx config | tar | /root/backups/ |
Daily 03:00 | 30 days | Tar extract to /var/www/html/ |
| Keycloak | (TBD) | — | — | — | — |
Also document the restore commands inline so any operator can recover without consulting external notes.
Problem: The kcadm.sh command failed silently when using -f /dev/stdin through docker exec, because stdin forwarding does not work correctly through docker exec in non-interactive mode.
Why it happened: Operator assumed stdin piping would work the same as a local shell execution.
Recommendation — Document kcadm.sh Patterns:
All future Keycloak administration should use one of these confirmed-working patterns:
# Pattern 1: Write payload to temp file, then reference it
cat > /tmp/role.json << 'EOF'
{ "name": "wiki-user", "composite": false }
EOF
docker exec -i keycloak /opt/keycloak/bin/kcadm.sh create roles -r veripath -f /tmp/role.json
# Pattern 2: Use heredoc with docker exec -i
docker exec -i keycloak /opt/keycloak/bin/kcadm.sh create clients -r veripath -f - << 'EOF'
{ "clientId": "wiki-js", "publicClient": true, "redirectUris": ["https://wiki.veripath.co.uk/*"] }
EOF
# Pattern 3: Copy file first, then exec
docker cp /tmp/payload.json keycloak:/tmp/
docker exec keycloak /opt/keycloak/bin/kcadm.sh create clients -r veripath -f /tmp/payload.json
Document these patterns in the Keycloak administration wiki page at /infrastructure/keycloak.
All 7 findings have been reviewed and the following corrective measures are now in place:
| # | Finding | Recommendation | Status | Evidence |
|---|---|---|---|---|
| 1 | No pre-work backup | Pre-modification backup checklist script | ✅ Implemented | /usr/local/bin/preflight-backup.sh exists (646 bytes, executable). A fresh pre-work backup was taken before these wiki updates. |
| 2 | Hardcoded IPs | Use Docker DNS names | ✅ Implemented | Wiki.js DB_HOST confirmed as 3ff9c011e629_gp_booking_postgres (Docker DNS name, not static IP). |
| 3 | No restart procedure | Post-Docker-restart verification script | ✅ Implemented | /usr/local/bin/post-restart-verify.sh exists (3223 bytes, executable). |
| 4 | No backup health monitoring | Freshness check with alerting | ✅ Implemented | /usr/local/bin/check-backup-freshness.sh exists (2845 bytes, executable). Cron job at /etc/cron.d/wiki-backup-health runs daily at 07:00. |
| 5 | Untested restore procedures | Quarterly restore drill | ⏳ Pending | Script and procedure exist. Next quarterly drill should be scheduled by 25 Aug 2026. |
| 6 | No centralised documentation | Create backup register wiki page | ✅ Implemented | Pages created: infrastructure/backups/overview and infrastructure/backups/quarterly-restore-drill. |
| 7 | Docker exec stdin limitation | Document working kcadm.sh patterns | ✅ Implemented | Working patterns (temp file, heredoc, docker cp) documented in /infrastructure/keycloak wiki page. |
Note: Finding 5 (quarterly restore drill) is still pending — the procedure and script are ready, but the first drill has not yet been executed. This should be scheduled before 25 August 2026.
| # | Finding | Recommendation | Priority |
|---|---|---|---|
| 1 | No pre-work backup | Pre-modification backup checklist script | High |
| 2 | Hardcoded IPs | Use Docker DNS names for service discovery | High |
| 3 | No restart procedure | Post-Docker-restart verification script | High |
| 4 | No backup health monitoring | Freshness check with alerting | Medium |
| 5 | Untested restore procedures | Quarterly restore drill | Medium |
| 6 | No centralised documentation | Create backup register wiki page | Medium |
| 7 | Docker exec stdin limitation | Document working kcadm.sh patterns | Low |