Tuning¶
What to change in db.env and when. Pair with stats — tune from observed cache hit rates, not guesses.
The short version¶
# db.env
export THREADS=0 # auto = 4 × nproc, min 4
export WORKERS=0 # auto = max(CPU count, 4)
export FCACHE_MAX=4096 # raise if shard-mmap cache hit rate < 90%; allow-list {4096, 8192, 12288, 16384}
# BT_CACHE_MAX is no longer configurable — derived as FCACHE_MAX/4 (2026.05.1+)
export MAX_REQUEST_SIZE=33554432 # 32 MB; per-connection read buffer (memory planning!)
export QUERY_BUFFER_MB=500 # per-query intermediate buffer cap (collect-hash, KeySet, etc.)
export GLOBAL_LIMIT=100000 # soft result-set cap
export SLOW_QUERY_MS=500 # slow query threshold (floor 100 ms)
export TIMEOUT=0 # statement timeout (seconds, 0=off; per-request `timeout_ms` overrides)
export INDEX_PAGE_SIZE=4096 # B+ tree page size (power-of-2, 1024–65536)
Only change what the data says to change.
THREADS — scan parallelism¶
Number of worker threads spawned for parallel shard scans (find, count, aggregate, index builds).
- Default 0 = online CPU count (
nproc). - Lower if the host has other workloads sharing CPU.
- Higher rarely helps — the work is CPU-bound on mmap reads; going above physical cores causes cache thrash.
When to care: large full scans. For indexed queries (which touch tiny candidate sets), THREADS barely matters.
WORKERS — server thread pool¶
Number of threads servicing incoming connections.
- Default 0 =
max(online CPUs, 4). - Bump for heavy pipelining / high connection concurrency.
- Each worker adds memory (
MAX_REQUEST_SIZEbuffer × number of concurrent connections, not workers).
When to care: stats shows active_threads consistently at the cap. Usually not the bottleneck — threads are cheap; memory and disk are the limits.
FCACHE_MAX — unified shard mmap cache (drives BT_CACHE_MAX too)¶
Capacity (in entries, not bytes) of the shared shard mmap cache (ucache). Every entry is one shard's mmap region. Since 2026.05.1, BT_CACHE_MAX is derived from this as FCACHE_MAX / 4 and is no longer configurable on its own.
- Default 4096 (so
bt_cachecapacity = 1024). - Strict allow-list:
{4096, 8192, 12288, 16384}. Invalid values fall back to default with a warning. - Each object has
splitsshards inucache, plussplits/4per-field idx files inbt_cache(per-shard btree layout). - Raise if either
ucache.hits / (hits + misses) < 90%(read-heavy) orbt_cache.hits / (hits + misses) < 90%(indexed-query heavy). - Lower not possible —
4096is the floor of the allow-list.
When to care:
- Many objects × many splits, with query latency higher than expected.
- Sum objects × avg(splits) for ucache sizing; objects × avg(indexes) × avg(splits/4) for bt_cache sizing.
- Bumping FCACHE_MAX from 4096 → 8192 doubles both caches. With 100 objects × 64 splits × 14 indexes, the per-shard layout creates 100 × 64 + 100 × 14 × 16 = 28 800 mmap entries — comfortably above 4096; bump to 16384 for full residency.
BT_CACHE_MAX set in db.env is ignored with a stderr warning.
MAX_REQUEST_SIZE — per-request ceiling¶
Maximum bytes in one JSON request line. Oversized requests are drained and rejected with {"error":"Request too large (max N bytes)"}.
- Default 32 MB (
33554432). - Every active connection allocates a read buffer this size. 100 connections × 32 MB = 3.2 GB resident.
- Raise for:
- Large bulk-inserts (
MAX_REQUEST_SIZE / avg_record_size= max records per request). - Large file uploads (base64 inflates 4/3, so 32 MB ≈ 24 MB effective file).
- Don't raise blindly — memory cost scales with concurrency.
Sizing rule of thumb¶
Leave room for the ucache, bt_cache, page cache, and working memory.
GLOBAL_LIMIT — default result limit¶
Default limit applied when a query omits one (or sets limit ≤ 0).
- Default 100 000.
- Pure fallback — there is no server-side clamp. A per-query
"limit": 500000returns 500K records even withGLOBAL_LIMIT=100000. - Prevents runaway queries from accidentally materializing the whole object when a caller forgets
limit.
If you need a hard ceiling that cannot be overridden, that's not what GLOBAL_LIMIT is — enforce it in the application layer or behind a reverse proxy. Pair the limit-default with QUERY_BUFFER_MB (a real cap on intermediate-buffer memory; the query aborts if exceeded) for memory protection.
SLOW_QUERY_MS — slow query threshold¶
Queries exceeding this duration get logged to the in-memory slow-query ring (visible via stats) and to slow-YYYY-MM-DD.log.
- Default 500 ms. Minimum 100 ms (lower values ignored).
0= disable.- Raise if normal queries routinely cross the threshold and the log is noise.
TIMEOUT — statement timeout¶
Seconds before a long-running query is cancelled cooperatively. Checked every 1024 iterations inside scan loops, so precision is coarse (tens of milliseconds, not microseconds).
- Default 0 = disabled.
- Recommended: set to a protective upper bound (e.g., 30 seconds) to prevent stuck queries blocking worker threads.
- Per-request
"timeout_ms":Noverrides forfind/count/aggregate/bulk-delete/bulk-update. Use a tighttimeout_msfor interactive callers without lowering the global default. - Applies to scans + bulk-update/delete; does not apply to single-record writes.
Response on timeout: {"error":"query_timeout"}.
QUERY_BUFFER_MB — per-query memory cap¶
Upper bound on intermediate buffers any single query can hold — collect-hash arrays, KeySets (OR union, AND intersection), aggregate hash tables, ordered-find sort buffers, bulk-delete/update key lists.
- Default 500 MB.
- Checked at every collection site. Exceeding triggers
{"error":"query memory buffer exceeded; narrow criteria, add limit/offset, or stream via fetch+cursor"}and the server keeps serving. - Peak RAM per query ≈
QUERY_BUFFER_MB × 1(true RSS ~10–15% higher from malloc overhead).
When to care: heavy ad-hoc analytics that legitimately need bigger working sets. Multiply by expected concurrent heavy queries when sizing the host. Pair with whole-process containment (systemd MemoryMax=, cgroup memory.max, ulimit -v) as a backstop.
INDEX_PAGE_SIZE — B+ tree page size¶
- Default 4096 bytes (matches typical page cache granularity).
- Valid range 1024–65536, must be power-of-2.
- Larger pages: fewer levels (faster descent), but more data read per page (wastes I/O on small scans).
- Smaller pages: more levels, more compact per-page.
Don't change without a specific reason and a benchmark to prove it. 4096 is a sweet spot on Linux x86_64.
Disk layout¶
- ext4 / xfs — both tested, both fine.
- SSD — strongly recommended. shard-db is I/O-bound on scans once beyond page cache.
- NVMe — helps for bulk loads; marginal for steady-state queries.
Don't use network filesystems (NFS, CIFS) for $DB_ROOT unless you deeply trust their mmap semantics. Latency and coherence bugs compound.
Sizing splits¶
splits is the number of shard files per object, fixed at create-object time and changeable only via vacuum --splits=N (re-hashes every record). The right value is driven by records-per-shard, not by load factor.
Sweet spot: 78K–200K records/shard. Acceptable to ~500K; past ~1M you're saturating this design. (Validated on the parallel K/V bench: 10M rows at 128 splits = 78K rec/shard completed in 3.488s. 64 splits: 3.605s. 256 splits: 3.986s. 1024 splits: 5.454s — too many shards = too many small files = more syscalls per query.)
| Expected rows | Recommended splits |
Records/shard at cap |
|---|---|---|
| up to 1M | 8 | ~125K |
| 1–4M | 16 | ~250K |
| 4–10M | 32 | ~313K |
| 10–25M | 64 | ~391K |
| 25–60M | 128 | ~469K |
| 60–125M | 256 | ~488K |
| 125–250M | 512 | ~488K |
| 250–500M | 1024 | ~488K |
| 500M–1B | 2048 | ~488K |
| 1B–2B+ | 4096 (MAX_SPLITS) | ~488K and up |
Defaults: create-object with no splits gives 8 (fine for sub-1M-row objects — test/demo/prototype/small-app, the ~70% case). For 1M+ rows set splits explicitly per the table above; otherwise re-split later with vacuum --splits=N. The shard-stats nag fires at 500K records/shard, so on the default 8 splits you get ~4M rows of headroom before the daemon tells you to re-split.
The daemon will tell you when to re-split. Run
./shard-db shard-stats <dir> <object>periodically. Whenavg_rec_per_shardcrosses 500K, the output emitshint: records-per-shard approaching upper band (>500K) — consider vacuum --splits=N. Past 1M it escalates tohint: re-split with vacuum --splits=N. Past 1M and already atMAX_SPLITS=4096, the hint switches topartition by object— at that scale you want multiple B+ trees, not a larger one. Tier transitions in the table above land before the 500K nag, so following the table keeps you out of the warning zone.
What else shard-stats flags¶
Beyond the records-per-shard nags above, shard-stats also surfaces shard-load skew:
max_records > 4 × min_records→ hint:shard load is skewed — check key distribution. Usually means non-random keys — prefix-heavy keys likeord_0001,ord_0002hash fine, but raw sequential integers cluster badly. Inspect the per-shard table in the output.
When to run vacuum¶
- Tombstoned / active ratio > 10% and active > 1000 →
vacuum-checkflags it. Runvacuum(fast in-place reclaim). - After
remove-field→vacuum --compactto shrinkslot_size. - Shard skew (imbalanced load) or records-per-shard past the sweet spot →
vacuum --splits Nto reshard.
vacuum takes a write lock for the rebuild duration. Schedule during low-traffic windows on big objects.
Benchmarks¶
Benchmark scripts are at the repo root:
./bench-kv.sh,./bench-kv-parallel.sh— insert / get / update throughput../bench-queries.sh— find / count / aggregate latency../bench-joins.sh [count]— join throughput../bench-invoice.sh,./bench-parallel.sh— 14-index realistic scenario.
Use them to validate tuning changes empirically, not theoretically.