Benchmarks¶
Five canonical workloads on AMD Ryzen 7 7840U (8C / 16T) · 32 GB · NVMe ext4 · Linux 6.19 · gcc 15.2 -O2. Each scenario is a standalone script in bench/. All numbers are from end-to-end runs with the server over TCP — request parse, auth, encode, disk write, ACK are all in the measurement. Nothing is bypassed.
1. K/V single-threaded — 10M records¶
bench-kv.sh 10000000, SPLITS=128.
Schema: 16-byte hex key, one varchar(100) value — the same record shape used by LMDB / LevelDB / RocksDB db_bench so numbers compare directly. Unlike those embedded libraries, every request below crosses a TCP socket and goes through JSON/CSV parsing on the server.
| Operation | Throughput / Latency |
|---|---|
| Bulk insert (JSON, 10M in one request) | 1.97 M inserts/sec (5.08 s) |
| Bulk insert (CSV, 10M in one request) | 2.39 M inserts/sec (4.19 s) |
| GET ×10,000 (pipelined, 1 conn) | 22.1 k ops/sec (453 ms) |
| EXISTS ×10,000 hits (pipelined) | 22.8 k ops/sec (439 ms) |
| EXISTS ×10,000 all-miss (cold probe) | 66.7 k ops/sec (150 ms) |
| UPDATE ×10,000 (pipelined) | 17.8 k ops/sec (561 ms) |
| DELETE ×10,000 (pipelined) | 9.05 k ops/sec (1.11 s) |
| Parallel GET (5 conns × 10k) | 91.2 k ops/sec (548 ms) |
| Parallel UPDATE (5 conns × 10k) | 62.9 k ops/sec (795 ms) |
| Disk footprint | 2.3 GB |
2. K/V multi-threaded — 10M records, scaling across connections¶
bench-kv-parallel.sh 10000000 1000000 10, SPLITS=128.
Same schema, bulk insert fanned out across TCP connections. SPLITS=128 is the sweet spot for 10M rows (≈78K records/shard — see the splits sizing table); going to 1024 at this scale slows the benchmark.
| Scenario | Time | Throughput |
|---|---|---|
| Single JSON, 10M | 4.86 s | 2.06 M/sec |
| Single CSV, 10M | 4.27 s | 2.34 M/sec |
| Parallel JSON, 10 conns × 1M | 3.73 s | 2.68 M/sec (1.30× single) |
| Parallel CSV, 10 conns × 1M | 3.67 s | 2.72 M/sec (1.16× single) |
Shard load distribution (128 splits): avg 0.596, records stddev 1.6 %, 9 grows per shard.
How to read these numbers. On 16 B / 100 B records LMDB publishes ~1 M on-disk inserts/sec (embedded, no network). shard-db sustains 2.34 M/sec single-connection (CSV) and 2.72 M/sec across 10 connections, over TCP with CSV parsing on the server. Single-connection → parallel ratio is now only 1.16–1.30× because the single-connection path is already fast enough that CPU-bound mmap writeback is the ceiling — adding connections mostly shortens the wall clock, not the cost per record.
3. Queries on 1M users¶
bench-queries.sh.
13 typed fields (varchar, int, long, short, double, bool, byte, date, datetime, numeric, currency). Indexes on username, email, age, active, birthday.
| Operation class | Latency band |
|---|---|
count metadata (no criteria) |
3 ms (O(1) counter file) |
count indexed eq / between / in / lt / gt / lte / gte |
3–12 ms |
count indexed starts / exists |
3–21 ms |
count indexed contains / ends / ncontains (leaf scan) |
41–44 ms |
count full-scan (non-indexed field) |
7–10 ms (scan-path is lock-free; each shard runs concurrently) |
count indexed + secondary filter |
16–48 ms |
find limit 10 — any indexed op |
2–4 ms |
find limit 10 — full scan on non-indexed |
2–3 ms (Zone A probe + typed compare) |
find indexed + secondary filter |
2–3 ms |
aggregate count (metadata) |
3 ms |
aggregate where indexed-eq |
12–21 ms |
aggregate where indexed range |
54–114 ms |
aggregate full-scan (sum/avg/min/max) |
419 ms |
aggregate group_by on full scan |
221–279 ms |
aggregate group_by + having |
326 ms |
find cursor page 1 (ASC/DESC, indexed order_by) |
2–4 ms |
find cursor continuation (mid-range seek) |
2–3 ms |
All 17 search operators use indexes when available. Full scans stay fast because Zone A (24-byte metadata headers) remains resident in the page cache and typed binary records in Zone B are compared without JSON parsing.
4. Invoice single-threaded — 1M records, 64 fields, 14 indexes¶
bench-invoice.sh 1000000 persistent, SPLITS=64.
Realistic wide-object schema (~1.9 KB/record). Composite indexes include irbmStatus+pdfSent, status+source, status+createdAt, status+invoiceDate.
| Operation | Result |
|---|---|
| Bulk insert (no indexes) | 117 k/sec (8.55 s) |
| Bulk insert (with 14 indexes) | 90 k/sec (11.13 s) — 23 % index overhead |
| Add 14 indexes post-insert | 2.85 s (per-shard parallel build — 14 × splits/4 workers) |
| GET ×1000 (pipelined) | 42 k ops/sec (24 ms) |
| EXISTS ×1000 (pipelined) | 48 k ops/sec (21 ms) |
Indexed eq find (any of 14 indexes, limit 10) |
5 ms |
Indexed contains via leaf scan |
5–15 ms |
| Indexed IN (2 values) | 5 ms |
| Composite index eq / starts | 4–5 ms |
Indexed range |
3 ms |
| Fetch page of 100 @ offset 5000 | 5 ms |
| Keys (first 100) | 4 ms |
| Single DELETE ×1000 (with 14 indexes) | 7.8 k/sec (129 ms) — 2.7× faster vs pre-2026.05.1 |
| Bulk DELETE ×1000 | 77 k/sec (13 ms) — 16× faster vs pre-2026.05.1 |
| VACUUM | 8 ms |
| Disk footprint | 1.6 GB |
The delete speedups come from bulk_del_shard_worker and single_delete paths now going through the unified shard cache (ucache_get_write per shard). Pre-2026.05.1 they did per-call open + flock + mmap MAP_SHARED + munmap, paying full page-fault tax per request.
5. Invoice multi-threaded — 1M records, 64 fields, 14 indexes¶
bench-parallel.sh 1000000 200000 5, SPLITS=64.
Same schema, 5 connections × 200 k records each — the sweet spot for indexed bulk inserts at this scale (see chunk-size tuning below).
| Scenario | Time | Throughput |
|---|---|---|
| Single JSON, 1M, no indexes | 8.20 s | 122 k/sec |
| Single CSV, 1M, no indexes | 4.20 s | 238 k/sec |
| Parallel JSON, 5 conns, no indexes | 3.91 s | 256 k/sec |
| Parallel JSON, 5 conns, pre-existing 14 indexes | 6.21 s | 161 k/sec |
| Parallel CSV, 5 conns, no indexes | 3.62 s | 276 k/sec |
| Parallel CSV, 5 conns, pre-existing 14 indexes | 5.66 s | 177 k/sec |
| Add 14 indexes after parallel bulk insert | ~2.85 s | (per-shard parallel build) |
| Disk footprint (with 14 indexes) | 1.7 GB |
Indexed bulk-insert chunk-size tuning¶
The per-shard btree layout (2026.05.1+) makes indexed bulk-insert sensitive to the number of bulk-insert REQUESTS, because each request triggers a sequential bulk_merge cycle per (field, shard). Cumulative extract work scales O(R²) where R is request count. Measured on this 1M dataset:
| Shape | Requests | TEST 3 (JSON+idx) | TEST 5 (CSV+idx) |
|---|---|---|---|
| 10 conn × 100 k | 10 | 7.34 s | 7.04 s |
| 5 conn × 200 k | 5 | 6.21 s | 5.66 s |
| 5 conn × 100 k (queued) | 10 | 8.35 s | 6.42 s |
| 1 × 1 M (single call) | 1 | 11.13 s (single-thread) | — |
Bigger chunks = fewer merge cycles. 5 connections × 200 k each is the sweet spot for 1M records. Connection count above 5 doesn't speed up phase-2 data writes meaningfully (16-core box doesn't saturate at 5 writers), and pushing chunk count up linearly hurts phase-4 merge cost. Recommended for indexed bulk-insert: requests ≈ N / 200_000 rounded down, with 5 ≤ connections ≤ requests.
Load-then-index now wins for static schemas¶
With the per-shard layout, post-hoc add-indexes parallelises 14× wider (14 fields × splits/4 shards = 14 × 16 = 224 workers from a single-pass scan), making it 25–30 % faster than pre-2026.05.1. At 1M × 14 indexes: load CSV (3.62 s) + add-indexes (2.85 s) = 6.47 s → 155 k/sec. Insert CSV with pre-existing indexes: 5.66 s → 177 k/sec. Pre-existing-indexes still wins on absolute throughput by ~14 %, but the gap shrunk; load-then-index is preferred when feasible because the merge-into-existing path scales worse with parallelism.
Recommended: load-then-index for static schemas at 1M+ records; pre-existing indexes for streaming workloads.
Disk footprint¶
Per-shard btree layout adds ~25 % to indexed-object disk usage vs pre-2026.05.1 (1.3 → 1.6–1.7 GB on the invoice schema). Sources:
- Each btree starts at
2 × bt_page_size = 8 KB. With 14 indexes × 16 idx shards = 224 trees minimum, that's ~1.8 MB of header overhead before any data (vs 14 × 8 KB = 112 KB for the old single-tree layout). - Reduced prefix-compression effectiveness: each leaf page has 1/16 the entries to share prefixes with, so per-entry compression savings drop ~15–25 %.
- Page-allocation rounding: each btree's pages are 4 KB; trailing slack accumulates across 16× more trees.
Real space cost on production datasets typically lands at +20–30 % vs the legacy layout.
Notes¶
-
File-descriptor limit. At
SPLITS ≥ 512,ucache_grow_shardbriefly holds 2 fds per shard during migration, so peak can hit ~8,256 fds at the defaultFCACHE_MAX=4096. The server auto-raises its soft limit to the hard limit at startup (no privilege needed); if the hard limit itself is too low (shells default to 1024 on many distros), the startup WARN tells you exactly what to put in/etc/security/limits.confor asLimitNOFILE=in a systemd unit. -
CSV vs JSON. CSV bulk insert is faster because the CSV path parses directly against the mmap'd file via
(ptr, len)spans with zero per-line memcpy, while the JSON path materializes aJsonObjper record.
Splits tuning¶
Size splits to keep records-per-shard in the 78K–200K sweet spot (acceptable up to ~500K, degradation past ~1M). create-object defaults to splits=16 when omitted — fine for test/demo loads, too low for anything above a few million rows. Pick from expected row count:
| Expected rows | Recommended splits |
Records/shard at target |
|---|---|---|
| < 1M | 8–32 | up to ~125K |
| 1–10M | 64 | ~16K–156K |
| 10–25M | 128 | ~78K–195K (optimal band) |
| 25–50M | 256 | ~98K–195K |
| 50–100M | 512 | ~98K–195K |
| 100–250M | 512 | ~200K–488K (acceptable) |
| 250–500M | 1024 | ~244K–488K |
| 500M–1B | 2048 | ~244K–488K |
| 1–4B | 4096 (MAX_SPLITS) | ~244K–976K (at limit) |
Numbers are from the parallel K/V bench on 10M rows (128 splits fastest at 3.488s; 64 splits 3.605s; 256 splits 3.986s; 1024 splits 5.454s). Counter-intuitively, raising splits beyond the sweet spot slows things down even at 10 parallel connections — more shard files = more syscalls and mmap page faults per query, and shard-lock contention isn't the bottleneck at this scale. If you exceed ~1M records/shard you've saturated this design — split across multiple objects (or tenant dirs) rather than climbing past MAX_SPLITS=4096.
Reproduce¶
./bench/bench-kv.sh 10000000 # scenario 1 (default SPLITS=128)
./bench/bench-kv-parallel.sh 10000000 1000000 10 # scenario 2 (default SPLITS=128)
./bench/create-user-object.sh && \
./bench/insert-users.sh 1000000 && \
./bench/bench-queries.sh # scenario 3
./bench/bench-invoice.sh 1000000 persistent # scenario 4 (default SPLITS=64)
./bench/bench-parallel.sh 1000000 100000 10 # scenario 5 (default SPLITS=64)
Scripts self-resolve to the repo root regardless of CWD and start/stop the server automatically. All scripts honour SPLITS=N to override their per-script default (128 for the K/V scripts, 64 for the invoice scripts — matched to the splits sizing table for each record count).