Benchmarks¶

Five canonical workloads on AMD Ryzen 7 7840U (8C / 16T) · 32 GB · NVMe ext4 · Linux 6.19 · gcc 15.2 -O2. Each scenario is a standalone script in bench/. All numbers are from end-to-end runs with the server over TCP — request parse, auth, encode, disk write, ACK are all in the measurement. Nothing is bypassed.

1. K/V single-threaded — 10M records¶

bench-kv.sh 10000000, SPLITS=128.

Schema: 16-byte hex key, one varchar(100) value — the same record shape used by LMDB / LevelDB / RocksDB db_bench so numbers compare directly. Unlike those embedded libraries, every request below crosses a TCP socket and goes through JSON/CSV parsing on the server.

Operation	Throughput / Latency
Bulk insert (JSON, 10M in one request)	1.97 M inserts/sec (5.08 s)
Bulk insert (CSV, 10M in one request)	2.39 M inserts/sec (4.19 s)
GET ×10,000 (pipelined, 1 conn)	22.1 k ops/sec (453 ms)
EXISTS ×10,000 hits (pipelined)	22.8 k ops/sec (439 ms)
EXISTS ×10,000 all-miss (cold probe)	66.7 k ops/sec (150 ms)
UPDATE ×10,000 (pipelined)	17.8 k ops/sec (561 ms)
DELETE ×10,000 (pipelined)	9.05 k ops/sec (1.11 s)
Parallel GET (5 conns × 10k)	91.2 k ops/sec (548 ms)
Parallel UPDATE (5 conns × 10k)	62.9 k ops/sec (795 ms)
Disk footprint	2.3 GB

2. K/V multi-threaded — 10M records, scaling across connections¶

bench-kv-parallel.sh 10000000 1000000 10, SPLITS=128.

Same schema, bulk insert fanned out across TCP connections. SPLITS=128 is the sweet spot for 10M rows (≈78K records/shard — see the splits sizing table); going to 1024 at this scale slows the benchmark.

Scenario	Time	Throughput
Single JSON, 10M	4.86 s	2.06 M/sec
Single CSV, 10M	4.27 s	2.34 M/sec
Parallel JSON, 10 conns × 1M	3.73 s	2.68 M/sec (1.30× single)
Parallel CSV, 10 conns × 1M	3.67 s	2.72 M/sec (1.16× single)

Shard load distribution (128 splits): avg 0.596, records stddev 1.6 %, 9 grows per shard.

How to read these numbers. On 16 B / 100 B records LMDB publishes ~1 M on-disk inserts/sec (embedded, no network). shard-db sustains 2.34 M/sec single-connection (CSV) and 2.72 M/sec across 10 connections, over TCP with CSV parsing on the server. Single-connection → parallel ratio is now only 1.16–1.30× because the single-connection path is already fast enough that CPU-bound mmap writeback is the ceiling — adding connections mostly shortens the wall clock, not the cost per record.

3. Queries on 1M users¶

bench-queries.sh.

13 typed fields (varchar, int, long, short, double, bool, byte, date, datetime, numeric, currency). Indexes on username, email, age, active, birthday.

Operation class	Latency band
`count` metadata (no criteria)	3 ms (O(1) counter file)
`count` indexed eq / between / in / lt / gt / lte / gte	3–12 ms
`count` indexed `starts` / `exists`	3–21 ms
`count` indexed `contains` / `ends` / `ncontains` (leaf scan)	41–44 ms
`count` full-scan (non-indexed field)	7–10 ms (scan-path is lock-free; each shard runs concurrently)
`count` indexed + secondary filter	16–48 ms
`find` limit 10 — any indexed op	2–4 ms
`find` limit 10 — full scan on non-indexed	2–3 ms (Zone A probe + typed compare)
`find` indexed + secondary filter	2–3 ms
`aggregate count` (metadata)	3 ms
`aggregate` where indexed-eq	12–21 ms
`aggregate` where indexed range	54–114 ms
`aggregate` full-scan (sum/avg/min/max)	419 ms
`aggregate` group_by on full scan	221–279 ms
`aggregate` group_by + having	326 ms
`find` cursor page 1 (ASC/DESC, indexed `order_by`)	2–4 ms
`find` cursor continuation (mid-range seek)	2–3 ms

All 17 search operators use indexes when available. Full scans stay fast because Zone A (24-byte metadata headers) remains resident in the page cache and typed binary records in Zone B are compared without JSON parsing.

4. Invoice single-threaded — 1M records, 64 fields, 14 indexes¶

bench-invoice.sh 1000000 persistent, SPLITS=64.

Realistic wide-object schema (~1.9 KB/record). Composite indexes include irbmStatus+pdfSent, status+source, status+createdAt, status+invoiceDate.

Operation	Result
Bulk insert (no indexes)	117 k/sec (8.55 s)
Bulk insert (with 14 indexes)	90 k/sec (11.13 s) — 23 % index overhead
Add 14 indexes post-insert	2.85 s (per-shard parallel build — 14 × splits/4 workers)
GET ×1000 (pipelined)	42 k ops/sec (24 ms)
EXISTS ×1000 (pipelined)	48 k ops/sec (21 ms)
Indexed eq `find` (any of 14 indexes, limit 10)	5 ms
Indexed `contains` via leaf scan	5–15 ms
Indexed IN (2 values)	5 ms
Composite index eq / starts	4–5 ms
Indexed `range`	3 ms
Fetch page of 100 @ offset 5000	5 ms
Keys (first 100)	4 ms
Single DELETE ×1000 (with 14 indexes)	7.8 k/sec (129 ms) — 2.7× faster vs pre-2026.05.1
Bulk DELETE ×1000	77 k/sec (13 ms) — 16× faster vs pre-2026.05.1
VACUUM	8 ms
Disk footprint	1.6 GB

The delete speedups come from bulk_del_shard_worker and single_delete paths now going through the unified shard cache (ucache_get_write per shard). Pre-2026.05.1 they did per-call open + flock + mmap MAP_SHARED + munmap, paying full page-fault tax per request.

5. Invoice multi-threaded — 1M records, 64 fields, 14 indexes¶

bench-parallel.sh 1000000 200000 5, SPLITS=64.

Same schema, 5 connections × 200 k records each — the sweet spot for indexed bulk inserts at this scale (see chunk-size tuning below).

Scenario	Time	Throughput
Single JSON, 1M, no indexes	8.20 s	122 k/sec
Single CSV, 1M, no indexes	4.20 s	238 k/sec
Parallel JSON, 5 conns, no indexes	3.91 s	256 k/sec
Parallel JSON, 5 conns, pre-existing 14 indexes	6.21 s	161 k/sec
Parallel CSV, 5 conns, no indexes	3.62 s	276 k/sec
Parallel CSV, 5 conns, pre-existing 14 indexes	5.66 s	177 k/sec
Add 14 indexes after parallel bulk insert	~2.85 s	(per-shard parallel build)
Disk footprint (with 14 indexes)	1.7 GB

Indexed bulk-insert chunk-size tuning¶

The per-shard btree layout (2026.05.1+) makes indexed bulk-insert sensitive to the number of bulk-insert REQUESTS, because each request triggers a sequential bulk_merge cycle per (field, shard). Cumulative extract work scales O(R²) where R is request count. Measured on this 1M dataset:

Shape	Requests	TEST 3 (JSON+idx)	TEST 5 (CSV+idx)
10 conn × 100 k	10	7.34 s	7.04 s
5 conn × 200 k	5	6.21 s	5.66 s
5 conn × 100 k (queued)	10	8.35 s	6.42 s
1 × 1 M (single call)	1	11.13 s (single-thread)	—

Bigger chunks = fewer merge cycles. 5 connections × 200 k each is the sweet spot for 1M records. Connection count above 5 doesn't speed up phase-2 data writes meaningfully (16-core box doesn't saturate at 5 writers), and pushing chunk count up linearly hurts phase-4 merge cost. Recommended for indexed bulk-insert: requests ≈ N / 200_000 rounded down, with 5 ≤ connections ≤ requests.

Load-then-index now wins for static schemas¶

With the per-shard layout, post-hoc add-indexes parallelises 14× wider (14 fields × splits/4 shards = 14 × 16 = 224 workers from a single-pass scan), making it 25–30 % faster than pre-2026.05.1. At 1M × 14 indexes: load CSV (3.62 s) + add-indexes (2.85 s) = 6.47 s → 155 k/sec. Insert CSV with pre-existing indexes: 5.66 s → 177 k/sec. Pre-existing-indexes still wins on absolute throughput by ~14 %, but the gap shrunk; load-then-index is preferred when feasible because the merge-into-existing path scales worse with parallelism.

Recommended: load-then-index for static schemas at 1M+ records; pre-existing indexes for streaming workloads.

Disk footprint¶

Per-shard btree layout adds ~25 % to indexed-object disk usage vs pre-2026.05.1 (1.3 → 1.6–1.7 GB on the invoice schema). Sources:

Each btree starts at 2 × bt_page_size = 8 KB. With 14 indexes × 16 idx shards = 224 trees minimum, that's ~1.8 MB of header overhead before any data (vs 14 × 8 KB = 112 KB for the old single-tree layout).
Reduced prefix-compression effectiveness: each leaf page has 1/16 the entries to share prefixes with, so per-entry compression savings drop ~15–25 %.
Page-allocation rounding: each btree's pages are 4 KB; trailing slack accumulates across 16× more trees.

Real space cost on production datasets typically lands at +20–30 % vs the legacy layout.

Notes¶

File-descriptor limit. At SPLITS ≥ 512, ucache_grow_shard briefly holds 2 fds per shard during migration, so peak can hit ~8,256 fds at the default FCACHE_MAX=4096. The server auto-raises its soft limit to the hard limit at startup (no privilege needed); if the hard limit itself is too low (shells default to 1024 on many distros), the startup WARN tells you exactly what to put in /etc/security/limits.conf or as LimitNOFILE= in a systemd unit.
CSV vs JSON. CSV bulk insert is faster because the CSV path parses directly against the mmap'd file via (ptr, len) spans with zero per-line memcpy, while the JSON path materializes a JsonObj per record.

Splits tuning¶

Size splits to keep records-per-shard in the 78K–200K sweet spot (acceptable up to ~500K, degradation past ~1M). create-object defaults to splits=16 when omitted — fine for test/demo loads, too low for anything above a few million rows. Pick from expected row count:

Expected rows	Recommended `splits`	Records/shard at target
< 1M	8–32	up to ~125K
1–10M	64	~16K–156K
10–25M	128	~78K–195K (optimal band)
25–50M	256	~98K–195K
50–100M	512	~98K–195K
100–250M	512	~200K–488K (acceptable)
250–500M	1024	~244K–488K
500M–1B	2048	~244K–488K
1–4B	4096 (MAX_SPLITS)	~244K–976K (at limit)

Numbers are from the parallel K/V bench on 10M rows (128 splits fastest at 3.488s; 64 splits 3.605s; 256 splits 3.986s; 1024 splits 5.454s). Counter-intuitively, raising splits beyond the sweet spot slows things down even at 10 parallel connections — more shard files = more syscalls and mmap page faults per query, and shard-lock contention isn't the bottleneck at this scale. If you exceed ~1M records/shard you've saturated this design — split across multiple objects (or tenant dirs) rather than climbing past MAX_SPLITS=4096.

Reproduce¶

./bench/bench-kv.sh 10000000                          # scenario 1 (default SPLITS=128)
./bench/bench-kv-parallel.sh 10000000 1000000 10      # scenario 2 (default SPLITS=128)
./bench/create-user-object.sh && \
  ./bench/insert-users.sh 1000000 && \
  ./bench/bench-queries.sh                            # scenario 3
./bench/bench-invoice.sh 1000000 persistent           # scenario 4 (default SPLITS=64)
./bench/bench-parallel.sh 1000000 100000 10           # scenario 5 (default SPLITS=64)

Scripts self-resolve to the repo root regardless of CWD and start/stop the server automatically. All scripts honour SPLITS=N to override their per-script default (128 for the K/V scripts, 64 for the invoice scripts — matched to the splits sizing table for each record count).