Watchdog Help

How the watchdog operates, how maintenance mode works, and why the email protocol aims for safe reliability without alert fatigue.

How It Operates

  • The watchdog collector runs at regular intervals and writes JSON artifacts for hosts, services, storage, network, devices, and collector events.
  • The dashboard reads those artifacts from `/home/shared/webprojects/operations/watchdog/output` and summarizes current health.
  • Each artifact contains raw check tones, while the summary can apply maintenance-aware effective tones for operator-facing triage.

Email Safety Protocol

  • Immediate email is sent only when a new active critical appears, when the active critical set materially changes, or when a low-frequency reminder is due.
  • Recovery email is sent when active criticals clear.
  • A second monitoring path checks that collection and alert evaluation are still fresh, so silent failure of the primary alert flow is less likely to go unnoticed.
  • A weekly monitoring-health digest proves the monitoring path is alive and reminds you about maintained incidents.

Maintenance Mode

  • Maintenance mode does not hide the raw issue.
  • It keeps the underlying artifact or check tone intact, but suppresses immediate critical paging and turns the top-level summary into a warning instead of a critical.
  • Maintained incidents are still listed in the dashboard and the weekly monitoring-health email so they are not forgotten.

Why This Is Safer

  • Healthy interval-spam is avoided because repeated all-clear mail trains operators to delete alerts automatically.
  • Failure-only mail is strengthened by the second freshness monitor and weekly proof-of-life email.
  • Maintained issues remain reviewable, so maintenance mode is not a silent bypass.

Config path: /home/shared/webprojects/operations/watchdog/config/watchdog.conf. The page can edit maintenance rules here.

Critical

Current watchdog summary across generated artifacts: 8 healthy, 0 warning, 5 critical, 0 unknown.

Generated
2026-05-06 22:19
Watchdog Path
/home/shared/webprojects/operations/watchdog/output
Output Path Exists
Yes

Storage & Backups

Disk integrity, RAID or pool health, and backup freshness.

CRITICAL
Latest Artifact
backuppc.json
Last Updated
2026-05-06 22:15
Artifacts
2

Live Check Summary

Artifact Healthy
1
Artifact Warning
0
Artifact Critical
1
Artifact Unknown
0
Maintained Artifacts
0
Maintained Criticals
0
Checks Healthy
2
Checks Warning
0
Checks Critical
1
Checks Unknown
0
Maintained Checks
0

Operator Notes

  • Watch RAID, ZFS pool health, scrub age, SMART signals, and filesystem free space.
  • Track BackupPC freshness, failed clients, and backup storage headroom.

Collected Artifacts

Latest artifact details and the individual checks reported for this section.

BackupPC

BackupPC freshness and storage checks.

CRITICAL
Artifact File
backuppc.json
Updated
2026-05-06 22:15
Checks
1
Raw File
Open raw file
Maintenance
Off
Enter Maintenance Mode

This writes a maintenance rule into the watchdog config so the issue remains visible but no longer pages as an active critical.

  • BackupPC host homefile-data CRITICAL
    Latest completed successful backup for BackupPC host homefile-data is full #2685 from 2026-04-30 22:02 (age 6.0 days).
    Recent successful backups:
    #2685 full 2026-04-30 22:02 (6.0 days old);
    #2683 incr 2026-03-16 22:17 (51.0 days old);
    #2682 incr 2026-03-15 22:18 (52.0 days old).

Storage

Disk capacity and storage integrity checks.

HEALTHY
Artifact File
capacity-and-storage.json
Updated
2026-05-06 22:15
Checks
2
Raw File
Open raw file
Maintenance
Off
Enter Maintenance Mode

This writes a maintenance rule into the watchdog config so the issue remains visible but no longer pages as an active critical.

  • northbridge-main HEALTHY
    Storage used on / is 67.44.
  • northbridge-main HEALTHY
    Storage used on /var is 67.44.

Summary Artifact

{
    "label": "Critical",
    "tone": "critical",
    "detail": "Watchdog collector completed with 11 artifacts, including 0 warning, 3 critical, 0 maintained critical, and 0 unknown states.",
    "updated_at": "2026-05-06T12:15:02+00:00",
    "counts": {
        "healthy": 8,
        "warning": 0,
        "critical": 3,
        "unknown": 0,
        "suppressed_critical": 0,
        "maintained": 0
    },
    "alert_items": [
        {
            "kind": "check",
            "id": "backuppc_last_success",
            "label": "BackupPC / BackupPC host homefile-data",
            "detail": "Latest completed successful backup for BackupPC host homefile-data is full #2685 from 2026-04-30 22:02 (age 6.0 days). Recent successful backups: #2685 full 2026-04-30 22:02 (6.0 days old); #2683 incr 2026-03-16 22:17 (51.0 days old); #2682 incr 2026-03-15 22:18 (52.0 days old)."
        },
        {
            "kind": "check",
            "id": "arduino-pool_freshness",
            "label": "Arduino Pool Controller / Arduino Pool Controller",
            "detail": "Freshness file not found: /home/shared/webprojects/operations/watchdog/heartbeats/arduino-pool.heartbeat"
        },
        {
            "kind": "check",
            "id": "solar-inverter_freshness",
            "label": "Solar Inverter / Solar Inverter",
            "detail": "Freshness age is 77029 minutes for /home/shared/webprojects/operations/watchdog/heartbeats/solar-inverter.heartbeat."
        }
    ],
    "maintained_items": []
}