Monitoring¶

What to watch and how to hook it into your existing stack.

Built-in signals¶

shard-db exposes three layers of observability. All are query-driven — no dedicated metrics port.

1. `stats` — global snapshot¶

./shard-db stats

Returns (JSON):

{
  "uptime_ms": 3612450,
  "connections": {"active": 4, "total": 18231},
  "in_flight_writes": 0,
  "ucache":    {"used": 128, "total": 4096, "hits": 1820391, "misses": 4102},
  "bt_cache":  {"used": 48,  "total": 256,  "hits": 923104,  "misses": 847},
  "slow_queries": [
    {"mode":"find","object":"orders","duration_ms":1347,"at":"20260418153012"}
  ]
}

Key metrics to scrape:

Metric	Alert when
`uptime_ms`	Drops unexpectedly (process restart).
`connections.active`	Near `WORKERS` cap for extended periods.
`in_flight_writes`	Stays elevated (> 0) when traffic is idle — indicates stuck writes.
`ucache.hits / (hits + misses)`	Drops below 90% on a read-heavy workload — raise `FCACHE_MAX`.
`bt_cache.hits / (hits + misses)`	Drops below 90% on indexed queries — raise `FCACHE_MAX` (bt_cache is derived as `FCACHE_MAX/4` since 2026.05.1; not separately configurable).
`slow_queries[].duration_ms`	Any cross their SLO.

2. Slow query log¶

Queries exceeding SLOW_QUERY_MS (default 500 ms, floor 100 ms):

Last 64 kept in memory (visible in stats).
Persisted to $LOG_DIR/slow-YYYY-MM-DD.log — one JSON object per line.

tail -f /opt/shard-db/logs/slow-$(date +%Y-%m-%d).log

Use a log-based alert (jq + your monitoring stack, or a log-shipper like Vector / Fluent Bit) to alert on duration or frequency spikes.

3. `shard-stats` — per-shard detail¶

./shard-db shard-stats <dir> <obj>

Per shard: slot count, active records, tombstoned, load factor. Useful for:

Diagnosing skew (one shard with 2× the load of others).
Verifying vacuum --splits N rebalanced correctly.
Confirming a shard is about to grow (load > 50%).

Log files¶

All logs live under $LOG_DIR (default ./logs/).

File	Content	Rotation
`info-YYYY-MM-DD.log`	Server start/stop, schema mutations, vacuum runs, connection accept/close at debug level.	Daily. Pruned after `LOG_RETAIN_DAYS` (default 7).
`error-YYYY-MM-DD.log`	Auth failures, malformed requests, write errors, crashes.	Daily. Same retention.
`slow-YYYY-MM-DD.log`	Queries exceeding `SLOW_QUERY_MS`.	Daily. Same retention.

Set LOG_RETAIN_DAYS=0 to disable auto-prune and use logrotate instead (see Deployment → Log rotation).

Scraping into Prometheus¶

shard-db speaks Prometheus text-format natively via stats-prom. Two integration shapes:

A. Native `stats-prom` → textfile exporter¶

# /usr/local/bin/shard-db-metrics.sh
#!/bin/bash
TMP=/var/lib/node_exporter/textfile/shard_db.prom.$$
echo '{"mode":"stats-prom"}' | nc -q1 localhost 9199 \
  | sed -e 's/\x00//g' > "$TMP"
mv "$TMP" /var/lib/node_exporter/textfile/shard_db.prom

Run via cron or systemd timer (every 30 s is ample). Node exporter with --collector.textfile picks it up. The output is already in Prometheus exposition format — no jq, no per-metric assembly.

CLI shortcut:

./shard-db stats-prom > /var/lib/node_exporter/textfile/shard_db.prom

B. Dedicated exporter¶

Tiny Python / Go daemon that opens a long-lived connection, runs stats-prom on each scrape, and exposes /metrics directly. Saves the textfile dance and keeps the connection warm.

Metrics emitted¶

stats-prom exposes: - shard_db_uptime_seconds, shard_db_active_threads, shard_db_in_flight_writes - shard_db_ucache_used / _capacity / _bytes / _hits_total / _misses_total - shard_db_bt_cache_used / _capacity / _bytes / _hits_total / _misses_total - shard_db_slow_query_threshold_milliseconds, shard_db_slow_query_total

All counters use the _total convention; gauges have the unit in the name where applicable.

Alerting rules (starter set)¶

Adapt to your stack; these are the ones worth setting up first.

Rule	Trigger	Severity
Server down	No `stats` response for 60 s	Critical
High write backlog	`in_flight_writes > 50` for 5 min	High
ucache miss rate	`ucache_miss_rate > 15%` for 10 min	Medium
btcache miss rate	`bt_cache_miss_rate > 15%` for 10 min	Medium
Slow query rate	More than 10 slow queries per minute	Medium
Disk full	`$DB_ROOT` filesystem > 85%	High
Vacuum needed	`vacuum-check` returns any object for N days	Low

Health check¶

For load balancers:

echo '{"mode":"db-dirs"}' | nc -w1 localhost 9199

Exits non-zero on connect failure. Any JSON back = server alive.

For a deeper liveness probe, parse stats and fail if in_flight_writes > 100 for several consecutive checks.

Tracing¶

No built-in distributed tracing. The slow-query log is the nearest thing. If you need per-request tracing:

Wrap client calls to record start/end times and correlation IDs.
Emit to your tracer (OpenTelemetry, Jaeger, whatever).
Cross-reference with slow-*.log when investigating.

Key questions to monitoring should answer¶

Is the server alive? — db-dirs probe + process monitoring.
Are reads fast? — cache hit rates, slow query log.
Are writes flowing? — in_flight_writes, write latency from client metrics.
Is anyone hammering auth? — count error-*.log lines with auth failed.
Is disk filling? — per-object size + filesystem usage.
Are any objects overdue for maintenance? — daily vacuum-check.