Skip to main content

Building Production-Grade System Observability in Go

· 6 min read
Susant Sahani
Lead Developer

When we set out to build HyperSDK's observability layer, we had a clear constraint: no external dependencies. No Prometheus node exporter, no collectd, no StatsD sidecar. The daemon had to collect, store, analyze, and serve system metrics entirely on its own. This is the story of how we built a self-contained observability stack in Go using nothing but /proc, /sys, and a ring buffer.

Collecting Metrics from /proc and /sys

Linux exposes virtually everything about system state through its pseudo-filesystems. CPU utilization comes from /proc/stat, memory from /proc/meminfo, disk I/O from /proc/diskstats, and network throughput from /proc/net/dev. We parse these files at a configurable interval (default 15 seconds) and compute derived metrics like CPU percentage, memory pressure, and I/O wait.

The key insight is that most /proc files report cumulative counters, not instantaneous values. CPU times in /proc/stat are monotonically increasing tick counts. To compute utilization percentage, you need two readings and some arithmetic: usage = (active_delta / total_delta) * 100. We store the previous reading in memory and compute deltas on each collection cycle. The same pattern applies to disk I/O (sectors read/written) and network (bytes received/transmitted).

One subtlety we discovered early: /proc/meminfo reports MemAvailable on modern kernels (3.14+), which is a much better indicator of actual available memory than MemFree. The latter ignores buffers, caches, and reclaimable slab memory that the kernel will happily give back under pressure. We use MemAvailable when present and fall back to MemFree + Buffers + Cached on older kernels.

For per-process metrics, we read /proc/[pid]/stat and /proc/[pid]/status for each process. This gives us per-process CPU time, resident set size, virtual memory size, and thread count. We sort by CPU and memory usage and expose the top N processes through the API and dashboard.

The Health Score Algorithm

Raw metrics are useful for monitoring tools, but operators want a quick answer: is this system healthy? We distill all collected metrics into a single health score from 0 to 100.

The algorithm applies weighted penalties for resource exhaustion. Each resource category (CPU, memory, disk, network) has a threshold and a penalty function. If CPU usage exceeds 90%, the penalty is proportional to how far above the threshold it is. If disk usage exceeds 90%, the penalty is higher because disk exhaustion is harder to recover from.

The formula is straightforward: start at 100, subtract penalties. A system running at 95% CPU, 60% memory, 40% disk, and normal network gets a penalty of roughly 10 points for CPU, resulting in a score of 90. A system at 95% CPU and 92% disk gets penalties from both, dropping to around 70. The score naturally reflects the severity and breadth of resource pressure.

We also track which resources are bottlenecks and expose them in the API response. When the health score drops, the operator immediately knows whether the problem is CPU, memory, disk, or network without digging into charts.

Explain Mode: Why Is CPU High?

The most interesting feature we built is the explain mode. Instead of just showing that CPU is at 95%, explain mode answers why. It identifies the top contributing processes, correlates them with known patterns (e.g., "qemu-img process suggests a disk conversion is running"), and generates actionable recommendations.

The explain engine works in three stages. First, it collects current and recent metrics. Second, it ranks contributing factors by impact -- for CPU, this means listing processes sorted by CPU usage with context about what they are doing. Third, it applies a rules engine that matches patterns and generates recommendations. If the top CPU consumer is a qemu-img process, the recommendation might be to schedule disk conversions during off-peak hours or to use carbon-aware scheduling to shift the workload.

Explain mode is available for CPU, memory, disk, and network, each with its own set of patterns and recommendations. The rules are implemented as a simple Go slice of rule structs with match functions and response templates. Adding new rules is a matter of adding a struct to the slice.

Time-Series Storage: The Ring Buffer

We store 24 hours of metric history in a ring buffer. The implementation is a fixed-size slice with a write pointer that wraps around. Each data point is a timestamp-value pair. With a 15-second collection interval, we store approximately 5,760 points per metric per day, consuming roughly 92 KB per metric (16 bytes per point).

The ring buffer has several advantages over a database. It requires no external dependencies, has O(1) insert and O(n) scan performance, naturally evicts old data without cleanup jobs, and uses a fixed, predictable amount of memory. For a system that stores 10 metrics, the total memory footprint is under 1 MB.

Time-series queries support a step parameter for downsampling. When querying with step=5m, the API averages all data points within each 5-minute window and returns one point per window. This keeps response sizes manageable when graphing 24 hours of data in the dashboard.

Smart Alerts

The alert engine evaluates rules against collected metrics on each collection cycle. Default rules cover common failure modes: CPU above 90% for 2 minutes, memory above 85%, disk above 90%, swap above 50%, and OOM kills. Each alert has a severity (info, warning, critical) and a suppression window to prevent duplicate alerts from flooding the notification system.

Alerts are delivered through two channels: the REST API (polled by the dashboard) and webhooks (pushed to external systems). Webhook delivery is resilient to transient failures with exponential backoff retry. The webhook payload includes the alert details, current metric values, and a link to the explain mode endpoint for the affected resource.

Lessons Learned

Building observability from scratch taught us several things. First, /proc parsing is cheap -- reading and parsing all system metrics takes under a millisecond on modern hardware. There is no reason to rely on external agents for basic system metrics. Second, a ring buffer is an excellent data structure for bounded time-series storage when you do not need persistence across restarts. Third, the explain mode concept -- turning raw metrics into structured diagnostics with recommendations -- is far more useful to operators than raw dashboards. It eliminates the step where someone has to look at five charts and reason about what they mean together.

The observability layer now serves as the foundation for carbon-aware scheduling (which needs to know current system load), the health check endpoint (which powers the dashboard home page), and the alert system (which drives webhook notifications). Building it self-contained means the entire stack deploys as a single binary with zero runtime dependencies.