Monitoring¶
What to watch and how to hook it into your existing stack.
Built-in signals¶
shard-db exposes three layers of observability. All are query-driven — no dedicated metrics port.
1. stats — global snapshot¶
Returns (JSON):
{
"uptime_ms": 3612450,
"connections": {"active": 4, "total": 18231},
"in_flight_writes": 0,
"ucache": {"used": 128, "total": 4096, "hits": 1820391, "misses": 4102},
"bt_cache": {"used": 48, "total": 256, "hits": 923104, "misses": 847},
"slow_queries": [
{"mode":"find","object":"orders","duration_ms":1347,"at":"20260418153012"}
]
}
Key metrics to scrape:
| Metric | Alert when |
|---|---|
uptime_ms |
Drops unexpectedly (process restart). |
connections.active |
Near WORKERS cap for extended periods. |
in_flight_writes |
Stays elevated (> 0) when traffic is idle — indicates stuck writes. |
ucache.hits / (hits + misses) |
Drops below 90% on a read-heavy workload — raise FCACHE_MAX. |
bt_cache.hits / (hits + misses) |
Drops below 90% on indexed queries — raise FCACHE_MAX (bt_cache is derived as FCACHE_MAX/4 since 2026.05.1; not separately configurable). |
slow_queries[].duration_ms |
Any cross their SLO. |
2. Slow query log¶
Queries exceeding SLOW_QUERY_MS (default 500 ms, floor 100 ms):
- Last 64 kept in memory (visible in
stats). - Persisted to
$LOG_DIR/slow-YYYY-MM-DD.log— one JSON object per line.
Use a log-based alert (jq + your monitoring stack, or a log-shipper like Vector / Fluent Bit) to alert on duration or frequency spikes.
3. shard-stats — per-shard detail¶
Per shard: slot count, active records, tombstoned, load factor. Useful for:
- Diagnosing skew (one shard with 2× the load of others).
- Verifying
vacuum --splits Nrebalanced correctly. - Confirming a shard is about to grow (load > 50%).
Log files¶
All logs live under $LOG_DIR (default ./logs/).
| File | Content | Rotation |
|---|---|---|
info-YYYY-MM-DD.log |
Server start/stop, schema mutations, vacuum runs, connection accept/close at debug level. | Daily. Pruned after LOG_RETAIN_DAYS (default 7). |
error-YYYY-MM-DD.log |
Auth failures, malformed requests, write errors, crashes. | Daily. Same retention. |
slow-YYYY-MM-DD.log |
Queries exceeding SLOW_QUERY_MS. |
Daily. Same retention. |
Set LOG_RETAIN_DAYS=0 to disable auto-prune and use logrotate instead (see Deployment → Log rotation).
Scraping into Prometheus¶
shard-db speaks Prometheus text-format natively via stats-prom. Two integration shapes:
A. Native stats-prom → textfile exporter¶
# /usr/local/bin/shard-db-metrics.sh
#!/bin/bash
TMP=/var/lib/node_exporter/textfile/shard_db.prom.$$
echo '{"mode":"stats-prom"}' | nc -q1 localhost 9199 \
| sed -e 's/\x00//g' > "$TMP"
mv "$TMP" /var/lib/node_exporter/textfile/shard_db.prom
Run via cron or systemd timer (every 30 s is ample). Node exporter with --collector.textfile picks it up. The output is already in Prometheus exposition format — no jq, no per-metric assembly.
CLI shortcut:
B. Dedicated exporter¶
Tiny Python / Go daemon that opens a long-lived connection, runs stats-prom on each scrape, and exposes /metrics directly. Saves the textfile dance and keeps the connection warm.
Metrics emitted¶
stats-prom exposes:
- shard_db_uptime_seconds, shard_db_active_threads, shard_db_in_flight_writes
- shard_db_ucache_used / _capacity / _bytes / _hits_total / _misses_total
- shard_db_bt_cache_used / _capacity / _bytes / _hits_total / _misses_total
- shard_db_slow_query_threshold_milliseconds, shard_db_slow_query_total
All counters use the _total convention; gauges have the unit in the name where applicable.
Alerting rules (starter set)¶
Adapt to your stack; these are the ones worth setting up first.
| Rule | Trigger | Severity |
|---|---|---|
| Server down | No stats response for 60 s |
Critical |
| High write backlog | in_flight_writes > 50 for 5 min |
High |
| ucache miss rate | ucache_miss_rate > 15% for 10 min |
Medium |
| btcache miss rate | bt_cache_miss_rate > 15% for 10 min |
Medium |
| Slow query rate | More than 10 slow queries per minute | Medium |
| Disk full | $DB_ROOT filesystem > 85% |
High |
| Vacuum needed | vacuum-check returns any object for N days |
Low |
Health check¶
For load balancers:
Exits non-zero on connect failure. Any JSON back = server alive.
For a deeper liveness probe, parse stats and fail if in_flight_writes > 100 for several consecutive checks.
Tracing¶
No built-in distributed tracing. The slow-query log is the nearest thing. If you need per-request tracing:
- Wrap client calls to record start/end times and correlation IDs.
- Emit to your tracer (OpenTelemetry, Jaeger, whatever).
- Cross-reference with
slow-*.logwhen investigating.
Key questions to monitoring should answer¶
- Is the server alive? —
db-dirsprobe + process monitoring. - Are reads fast? — cache hit rates, slow query log.
- Are writes flowing? —
in_flight_writes, write latency from client metrics. - Is anyone hammering auth? — count
error-*.loglines withauth failed. - Is disk filling? — per-object
size+ filesystem usage. - Are any objects overdue for maintenance? — daily
vacuum-check.