Benchmarks¶
Five canonical workloads on AMD Ryzen 7 7840U (8C / 16T) · 32 GB · NVMe ext4 · Linux 6.19 · gcc 15.2 -O2. Each scenario is a C-level bench in src/bench/ invoked via ./build/bin/shard-db-bench run <name>. All numbers are from end-to-end runs with the server over TCP — request parse, auth, encode, disk write, ACK are all in the measurement. Nothing is bypassed.
Reads are strict request-response. The single-record read benches (GET / EXISTS / UPDATE / DELETE) wait for each response before sending the next request. Real-world clients that pipeline requests on the wire (multiple in flight at once) will see meaningfully higher per-connection throughput. Treat the per-op latencies as honest single-request floors and the throughputs as the lower-bound for clients that don't pipeline.
1. K/V single-threaded — 10M records¶
shard-db-bench run bench-kv with SHARD_BENCH_COUNT=10000000, SPLITS=128.
Schema: 16-byte hex key, one varchar(100) value — the same record shape used by LMDB / LevelDB / RocksDB db_bench so numbers compare directly. Unlike those embedded libraries, every request below crosses a TCP socket and goes through JSON/CSV parsing on the server.
| Operation | Throughput / Latency |
|---|---|
| Bulk insert (JSON, 10M in one request) | 2.69 M inserts/sec (3.72 s) |
| Bulk insert (CSV, 10M in one request) | 3.98 M inserts/sec (2.51 s) |
| Bulk EXISTS (10K keys / request) | 4.10 M ops/sec (2.4 ms) |
| Bulk DELETE (10K keys / request) | 1.76 M ops/sec (5.7 ms) |
| Bulk GET (10K keys / request) | 1.49 M ops/sec (6.7 ms) |
| Bulk UPDATE (10K keys / request) | 989 k ops/sec (10.1 ms) |
| GET ×10,000 (req-resp, 1 conn) | 34 k ops/sec (p50 28µs) |
| EXISTS ×10,000 hits (req-resp) | 34 k ops/sec (p50 28µs) |
| EXISTS ×10,000 all-miss (cold probe) | 36 k ops/sec (p50 26µs) |
| UPDATE ×10,000 (req-resp) | 27 k ops/sec (p50 32µs) |
| DELETE ×10,000 (req-resp) | 24 k ops/sec (p50 39µs) |
| Parallel GET (5 conns × 10k) | 154 k ops/sec (323 ms) |
| Parallel UPDATE (5 conns × 10k) | 131 k ops/sec (381 ms) |
| Disk footprint | 2.2 GB |
Bulk multi-key paths are 30-170× faster than single-conn. The single-conn req-resp ceiling is ~30 µs/op (dominated by TCP framing + JSON parse/encode). One bulk request handling 10K keys at a time hits 1.3M+/sec on every multi-key op — bulk_lookup_in_kfshard / bulk_get_in_kfshard / bulk_upsert_in_kfshard / bulk_delete_in_kfshard each acquire one kf wrlock per worker (vs per-record), batch seg I/O sorted by (stream_id, file_id), and use parallel_for fan-out across kf shards. Use bulk wire shapes ({"mode":"get","keys":[...]}, {"mode":"bulk-delete","keys":[...]}) when you have multi-key workloads.
2. K/V multi-threaded — 10M records, scaling across connections¶
shard-db-bench run bench-kv-parallel with SHARD_BENCH_TOTAL=10000000, SHARD_BENCH_CHUNK=2000000, SPLITS=128.
Same schema, bulk insert fanned out across TCP connections. SPLITS=128 is the sweet spot for 10M rows (≈78K records/shard — see the splits sizing table); going to 1024 at this scale slows the benchmark.
| Scenario | Time | Throughput |
|---|---|---|
| Single JSON, 10M | 3.26 s | 3.07 M/sec |
| Single CSV, 10M | 2.17 s | 4.60 M/sec |
| Parallel JSON, 5 conns × 2M | 1.33 s | 7.49 M/sec (2.44× single) |
| Parallel CSV, 5 conns × 2M | 1.12 s | 8.97 M/sec (1.95× single) ← fastest |
Shard load distribution (128 splits): avg 0.298 (kf), 78K records/shard, even distribution (records stddev <1 %).
How to read these numbers. On 16 B / 100 B records LMDB publishes ~1 M on-disk inserts/sec (embedded, no network). shard-db sustains 4.60 M/sec single-connection (CSV) and scales to 8.97 M/sec at 5 connections × 2 M records each, all over TCP with CSV parsing on the server. Parallel keeps a 1.9–2.4× edge over single-conn at this scale — multiple connections amortize per-call pipeline tail (parse, bucket, dispatch) against wider write fan-out. Use single-connection for operational simplicity, parallel for headline throughput.
Pre-grow (2026.05.x): when bulk-insert receives a batch, the dispatcher reads each shard's current slots_per_shard and live record count, computes the smallest power-of-2 that holds (live + incoming), and grows each shard to that target once before workers start. Pre-grows run in parallel via the worker pool. The previous behaviour (worker grew its shard every time it overflowed during the insert) caused 9 incremental grows per shard at SPLITS=128 / 10M records, each rebucketing the existing data; now that's 1 pre-grow per shard with zero rebucket (the shards are empty when pre-grow fires for a fresh load).
3. Queries on 1M users¶
shard-db-bench run bench-queries.
13 typed fields (varchar, int, long, short, double, bool, byte, date, datetime, numeric, currency). 12 indexes (username, email, age, active, birthday, user_id, rank, score, level, created_at, balance, hourly_rate); bio left non-indexed to exercise full-scan paths. C-bench measurements; cache is hot from the same-process bulk insert.
active is a bool field — since 2026.05.7 it auto-defaults to a bitmap index (one dense bit per slot, per-distinct-value). All other typed fields fall back to btree. The planner routes eq/IN/NEQ/NOT_IN through bitmap popcount and every other op through a per-shard dict-scan (≤ 256 dict entries per shard) so any operator goes through the index — never the data file.
Insert (1M records, 1 conn, JSON, 12 indexes built upfront): 0.24 M/sec (4.12 s).
COUNT — point lookups & ranges by field type¶
eq cost scales with match cardinality — the planner walks one btree
leaf entry per match. Point lookups (1 match) are sub-millisecond; large
match sets (100K+ records) pay btree-leaf-walk cost (~10-25 ns per entry,
parallel across idx shards). Cold-cache first reads add 1-3 ms one-time
disk-load tax per indexed field.
| op | matches | warm | cold (first query against this index) |
|---|---|---|---|
eq point lookup (e.g. user_id=N) |
1 | 0.07-0.21 ms | same |
eq mid-cardinality (e.g. age=30, ~17K) |
10K-100K | 0.2-0.5 ms | 0.5-1.5 ms |
eq low-cardinality (e.g. active=false, ~143K) |
>100K | 0.5-2 ms | 2-4 ms |
eq non-indexed (full scan) |
— | 46 ms | 50 ms |
neq (negation shortcut: total − count(eq)) |
inverted | 0.1-2 ms | 0.5-3 ms |
Range single-bound (gt / lt / gte / lte) |
varies | 0.4-9 ms | 1-12 ms |
Range coalesced (gt + lt on same field) |
varies | 0.5-6 ms (planner merges into one btree range) | 1-8 ms |
between |
varies | 1.6-4 ms | 2-6 ms |
in / not_in (2-4 values) |
small set | 0.08-1.0 ms | 0.1-1.5 ms |
Note: There is no count shortcut for low-cardinality eq (e.g.
active=false) — every match increments a TLS-batched counter. For
fields where eq matches >100K records, the cost is the btree leaf
walk, not a metadata lookup.
COUNT — string ops on indexed varchar¶
Substring/suffix ops on indexed varchar walk the btree leaves once (no record fetch — the value lives in the leaf), so they're 10–20× faster than the same op on a non-indexed field even though both visit O(N) entries.
| op | indexed varchar | non-indexed varchar (bio) |
|---|---|---|
starts |
0.3–0.5 ms (prefix range scan) | 42–46 ms (full record scan) |
ends |
21 ms (full leaf scan, leaf-byte suffix match) | 42 ms |
contains |
17–23 ms (full leaf scan, leaf-byte memmem) | 52–54 ms |
not_contains |
17 ms | 58 ms |
like 'foo%' |
1.1 ms (prefix range) | — |
like '%foo%' |
— | 57 ms |
Case-insensitive variants (istarts / icontains / iends / ilike) |
12–29 ms | 51 ms |
regex |
28–42 ms | 63–175 ms |
len_* (length operators) |
12–18 ms | 49–64 ms |
The non-indexed full-scan column tops out at ~50–60 ms because the v2 record walk now runs parallel across all 64 kf shards (one worker per shard via parallel_for), and within each shard refs are pre-sorted by (stream_id, file_id) so each segment is acquired once per shard instead of per record. That replaces ~1M g_segcache_lock acquires with ~320 (5–6 unique segments × 64 shards), turning what used to be 330 ms into 50 ms.
COUNT — existence / OR / field-vs-field¶
| op | latency |
|---|---|
exists typed non-varchar (always true → live_count shortcut) |
0.03–0.09 ms |
exists indexed varchar (btree size shortcut) |
9–11 ms |
not_exists indexed varchar (= live_count − count(exists)) |
11 ms |
exists non-indexed varchar |
67 ms (full scan) |
OR 2 leaves (same field, eq) |
45 ms |
OR 5 leaves |
105 ms (per-leaf btree walk + KeySet union, parallel across leaves but writes contend on shared KeySet) |
OR cross-field (one selective + one ~50% match) |
170 ms (dominated by the 500K KeySet inserts) |
eq_field / neq_field / lt_field / gt_field family |
33–44 ms (always full scan — RHS is per-record) |
FIND — limit-bound retrieval¶
Find-with-limit terminates as soon as the limit is reached; the parallel scan layer propagates an early-exit stop_flag so siblings bail without finishing their walks.
| query | latency |
|---|---|
Indexed eq / range / between / IN — limit 10 |
0.3–0.7 ms |
Indexed starts — limit 10 |
0.7 ms |
Indexed contains — limit 10 |
10 ms (leaf scan, no early exit until 10 matches) |
Indexed regex — limit 10 |
28 ms |
Indexed not_regex — limit 10 |
0.5 ms (rare-mismatch shortcut) |
Non-indexed contains / starts / ends on bio — limit 10 |
27–34 ms |
| Indexed AND across 2–3 fields | 14–75 ms (intersect cost grows with leaf cardinality) |
| Indexed + non-indexed AND (one selective indexed leaf) | 0.5 ms |
AGGREGATE — single-fn standalone (no criteria)¶
Min/max are O(1) per shard (read the first/last leaf) — the btree being sorted by value gives them away for free. Sum/avg must visit every leaf (no algebraic shortcut), so they sit at 15–23 ms across types.
| op | latency |
|---|---|
count (metadata shortcut) |
0.13 ms |
min / max on indexed (any type) |
0.04–0.31 ms |
sum / avg on indexed (any numeric type) |
15–23 ms |
AGGREGATE — with criteria (single-spec)¶
Walk-fetch-check fast path tries the agg btree first for any single-spec MIN/MAX with criteria — when criteria selectivity is >~0.05% it finds the answer in single-digit ms by walking the agg btree in MIN/MAX order, fetching records, and stopping at the first match per shard. Falls through to keyset-build for low-selectivity cases.
| query | latency |
|---|---|
min/max field where same_field op X (same-field shortcut) |
0.08–0.51 ms |
min/max field1 where field2 op X (cross-field, walk-fetch-check) |
0.1–0.5 ms |
min/max field1 where field2 AND field3 ... (intersect path) |
0.15–0.51 ms |
agg(neq) on small-domain field (NEQ shortcut: total − eq) |
0.5 ms |
sum/avg field1 where field2 op X (parallel agg-btree walk filtered by criteria KeySet) |
44–54 ms |
AGGREGATE — bundled & grouped¶
Indexed group_by walks the group field's btree once and looks up each entry's accumulator slot in a hash16→bucket map (built from the agg field's btree walk). For multi-spec sharing an agg field, that field's btree is walked once total. Multi-field group_by builds hash16→value maps for secondary group fields.
| query | latency |
|---|---|
count(*) + sum/avg/min/max balance (no group, multi-spec) |
27 ms |
group by active, count + avg(balance) (low cardinality) |
123 ms |
group by age top 5 by count |
17 ms (group cardinality 100, early-cap) |
group by age having n > 16k |
17 ms |
group by active, sum(balance) |
123 ms |
group by active, min/max balance |
123 ms |
group by birthday, count |
18 ms |
group by username, count (varchar idx, ~1M unique buckets) |
360 ms |
group by email, sum(balance) (varchar idx, ~1M buckets) |
645 ms |
group by F where indexed criterion (criteria-filtered group_by) |
165–250 ms |
group by F where AND of indexed leaves |
6.7 ms (intersect-on-indexed seeds tiny KeySet) |
group by F1, F2 (multi-field) |
355–490 ms |
For high-cardinality varchar group_by, the bucket creation cost dominates — each unique value allocates an AggBucket from the per-query arena (50 ns/bucket) and inserts into a dynamic hash table that doubles to keep load factor ≤ 1. At 1M unique varchar buckets, that's ~50 ms of bucket setup alone, then ~50 ms per agg field btree walk.
FIND — keyset cursor pagination¶
| query | latency |
|---|---|
ASC / DESC by indexed field, limit 100 |
0.18–0.38 ms |
| Cursor continuation mid-range | 0.94 ms |
offset 50000 limit 100 (buffered pre-skip) |
2.4 ms |
The cursor path is O(log N) per page — staying sub-millisecond regardless of depth. Use cursor:null continuation for deep pagination, not offset (which collects the full prefix into the per-query buffer and aborts past QUERY_BUFFER_MB).
All 38 search operators use indexes when available. Full scans (non-indexed fields, _field ops where RHS is per-record) parallelize across kf shards with per-segment batched cache acquire — the v2 record walk hits ~50 ms / 1M records on this hardware.
3a. Queries on 25M users — scaling characteristics¶
shard-db-bench run bench-queries with SHARD_BENCH_COUNT=25000000. Same 13-field schema and 12 indexes as §3; the bench's auto-tuned tier picker selects splits=128 (195K records/shard, fits the engine's 256K initial slots — no resplit during the 25M insert).
Insert (25M records, parallel, 12 indexes built upfront, disk-backed): ~120 k/sec single-conn, ~430 k/sec parallel × 5 conns (CSV form). 25M chunked in 1M batches.
These numbers assume warm cache (kernel page cache holds the relevant kf + segment pages). First-touch queries against an indexed field's btree pay a one-time disk-read cost (1-10 ms cold, sub-ms warm). The cold column shows the worst-case first-query latency after a cold daemon start; warm is what subsequent queries see.
COUNT — selectivity scaling¶
| query | warm | cold (first-touch on this index) |
|---|---|---|
eq username (point lookup, varchar idx) |
0.1 ms | 1-2 ms |
eq active=false (bool, bitmap popcount since 2026.05.7) |
5-20 ms | 30-60 ms |
neq active!=false (bool, bitmap subtraction shortcut) |
5-10 ms | 30-60 ms |
eq age=30 (~417K matches, int) |
1-3 ms | 50-150 ms |
eq bio (full scan, ~25M records) |
800-1500 ms | 6-9 s |
gt user_id>500000 (range, ~12.5M matches) |
200-400 ms | 1-2.5 s |
between age 30..40 |
30-50 ms | 300-500 ms |
contains bio 'DevOps' (full scan + simd memmem) |
800-1000 ms | 1-3 s |
icontains bio 'devops' (full scan + simd memcasemem) |
800-1000 ms | 1-3 s |
OR 5 leaves on age |
2-3 s | 9-12 s |
OR cross-field (age=30 OR active=false) |
~320 ms (was 4-5 s pre-2026.05.7 — active is now on a bitmap, half the keyset is sized + filled from a popcount instead of a btree leaf walk) |
1-2 s |
eq_field age == rank (always full scan) |
590-700 ms | 600-700 ms (CPU-bound) |
AGGREGATE — single-fn standalone¶
The leaf-only walker + MADV_SEQUENTIAL combo shipped in 2026.05.4 brings cold sum/avg from multi-second to a few hundred ms — disc reads coalesce into 128 KB+ I/Os instead of 4 KB-per-page faults under the default MADV_RANDOM.
| query | warm | cold (2026.05.4) | cold (2026.05.3) |
|---|---|---|---|
count all (metadata) |
0.6 ms | 0.6 ms | 0.6 ms |
min/max on indexed (int/long/etc.) |
0.1-1 ms | 0.2-1 ms | 0.5-200 ms |
sum age (int) |
20-30 ms | ~180 ms | 9-11 s |
sum user_id (long) |
similar | ~220 ms | similar to int |
sum balance (numeric) |
40-50 ms | ~210 ms | 13-16 s |
avg score (double) |
40-50 ms | ~40 ms (warm path) | 0.5-2 s |
AGGREGATE — with criteria on a bitmap-indexed field¶
2026.05.7 routes criteria on bool / future-enum fields through the field's bitmap index instead of a btree leaf walk. For aggregates that filter by active=true/false (or any bitmap field), the candidate KeySet is now sized exactly via bm_count per shard (cache-friendly stride-byte popcount) and populated via a per-shard bitmap walk. Previous behaviour fell through to a btree leaf walk with under-sized KeySet (the latent O(N × cap) probe trap fixed in 2026.05.7).
| query | warm | pre-bitmap (≤ 2026.05.5) |
|---|---|---|
avg score where active=false |
~2.7 s | 15-16 s |
aggregate(count + avg) where active=false |
~2.4 s | 15-16 s |
min/max <indexed> where active=true |
<1 ms (same-field shortcut still applies) | <1 ms |
The remaining 2-3 s isn't bitmap-related — it's the per-record fetch + score-extract for ~3.5M matching records. Streaming bitmap→kf→aggregator without materializing the keyset would close most of that gap; queued as a backlog item.
AGGREGATE — bundled & grouped¶
The parallel Pass 1 + parallel merge wins are most visible at this scale. 2026.05.4 added two streaming-distinct fast paths gated on single varchar group_by + finite limit + no criteria/having/order_by — for queries that match the shape, the wins are dramatic:
| query | warm |
|---|---|
group by active (count + avg) |
~820 ms (was 3-5 s pre-2026.05.7 — group by field is bool, hits bitmap dict-scan) |
group by age where active=true (eq crit) |
~1.9 s (was 41 s pre-2026.05.7 — criteria's keyset now sized via bitmap popcount, not under-allocated and prone to O(N × cap) probe trap) |
group by active where age 25..50 AND score>50 |
~37 ms (AND-intersect-on-indexed; primary is the most selective btree leaf, bitmap finishes the filter) |
group by age top 5 (limit-bound, indexed group_by) |
~75 ms |
group by username, count limit 10 (varchar idx, high-card) |
~3 ms cold (was 4-5 s — streaming k-way merge, 2026.05.4) |
group by email, sum(balance) limit 10 (varchar idx + indexed numeric agg) |
~4 ms cold (was 5-8 s — streaming k-way merge + per-emit lookup, 2026.05.4) |
group by username, count (no limit, full enumeration) |
4-5 s |
group by email, sum(balance) (no limit) |
5-8 s |
group by birthday, count (low cardinality) |
90-110 ms |
group by F where AND-of-indexed |
30-50 ms |
group by F1, F2 (multi-field, low cardinality) |
6-9 s |
CURSOR pagination¶
| query | warm |
|---|---|
ASC by indexed field, limit 100 |
0.2-15 ms |
| DESC by age | ~0.3 ms warm (O(1)-step DESC via last_leaf_page + prev_leaf since 2026.05.3 — supersedes the older "walks full forward chain" caveat) |
| ASC + criteria, continuation | 1-3 ms |
offset 50000 limit 100 |
3-5 ms |
Cold-cache caveat¶
Aggregate full-btree walks (sum/avg/max with no criteria) and OR-cross-field queries pay 5-15× cold-vs-warm penalty at 25M scale. First query against each indexed field reads ~750 MB of btree pages from disk; subsequent queries within the same daemon lifetime hit the OS page cache. Production deployments with steady query patterns hit warm numbers; a fresh daemon's first query against any large index pays the cold cost once.
For workloads that genuinely need fast cold-start (e.g. on-demand analytics) the kf header + segment data are mmap'd MAP_SHARED with MADV_HUGEPAGE — posix_fadvise(POSIX_FADV_WILLNEED) on startup would prefetch but isn't done today.
BTRH (2026.05.5+) and read performance¶
The B+ tree format rolled 'BTRG' → 'BTRH' to add a (value, hash) lexicographic order, which makes btree_delete descent O(log N) again (btree_delete knows both the value and the record's hash and can route directly to the unique leaf). Readers — btree_search, btree_range, cursor walk — only know the target value, not the record hash, so they continue to bsearch by value only. Index-read complexity and code paths are unchanged. The format does add 16 bytes per internal-page separator (the separator's hash), narrowing fanout slightly; measured against 25M users × 12 indexes, that did not produce a visible regression on any indexed-eq, range, AND-intersect, OR, or cursor query.
4. Invoice single-threaded — 1M records, 64 fields, 14 indexes¶
shard-db-bench run bench-invoice, SPLITS=64.
Realistic wide-object schema (~1.9 KB/record). Composite indexes include irbmStatus+pdfSent, status+source, status+createdAt, status+invoiceDate.
| Operation | Result |
|---|---|
| Bulk insert (no indexes) | 340 k/sec (2.93 s) |
| Bulk insert (with 14 indexes) | 200 k/sec (4.96 s) — 41% slowdown vs no-idx |
| Add 14 indexes post-insert | 2.59 s (per-shard parallel build — 14 workers × index_splits_for(splits) files each) |
| GET ×1000 (req-resp, 1 conn) | 29 k ops/sec (mean 35µs / p50 31µs) |
| EXISTS ×1000 (req-resp) | 38 k ops/sec (mean 26µs / p50 26µs) |
eq supplierId (~10K matches, limit 10) |
1.3 ms |
eq buyerId (~2K matches, limit 10) |
0.53 ms |
contains number (idx leaf scan, limit 10) |
26.6 ms |
contains batchNumber (idx leaf scan, limit 10) |
13.7 ms |
IN indexed (2 statuses, limit 10) |
0.59 ms |
eq status + range amount (AND, limit 10) |
1.0 ms |
eq status + non-indexed currency (AND, limit 10) |
299 ms (currency not indexed — full scan after status filter) |
Composite status+invoiceDate starts (limit 10) |
0.54 ms |
Composite status+source eq (limit 10) |
0.57 ms |
Composite irbmStatus+pdfSent eq (limit 10) |
0.28 ms |
RANGE invoiceDate (gte+lte, limit 10) |
0.40 ms |
RANGE createdAt (gte+lte, limit 10) |
0.37 ms |
OR two indexed statuses |
6.7 ms |
| Fetch page of 100 @ offset 5000 | <1 ms |
| Keys (first 100) | <0.2 ms — limit-bound calls (limit ≤ 1000) route through slotcask_walk_one_shard_streaming, which fires cb per record and bails immediately on stop_flag instead of completing the buffered Pass-1 ref-collect. Buffered path stays for unbounded scans (full-scan count etc.) where per-segment batched acquire still wins. |
count full object |
<1 ms |
| Single DELETE ×1000 (with 14 indexes) | 13 k/sec (mean 77µs / p50 73µs) |
| Bulk DELETE ×1000 | 135 k/sec (7.4 ms) |
| VACUUM | 38 ms (now actually rebuilds kf to drop tombstones — the previously-published 1 ms was the pre-2026.05.x reset_deleted_count path that lied about orphaned-count; the new vacuum is honest) |
| RECOUNT | <1 ms (kf-header sum — was 13 ms walking kf entries) |
| Disk footprint | 1.9 GB |
The earlier-published indexed-range figure (3 ms) was borrowed from §3's narrow between (age 25-35) query on the user schema; the C bench's invoice range queries cover wider date spans (one-month and two-week windows on invoiceDate/createdAt) so the 65–79 ms numbers are honest cost for matching tens of thousands of records and returning the first 10 ordered by index leaf order.
The delete speedups come from bulk_del_shard_worker and single_delete paths now going through the unified shard cache (ucache_get_write per shard). Pre-2026.05.1 they did per-call open + flock + mmap MAP_SHARED + munmap, paying full page-fault tax per request.
5. Invoice multi-threaded — 1M records, 64 fields, 14 indexes¶
shard-db-bench run bench-parallel, SPLITS=64, 5 connections × 200 k records each.
| Scenario | Time | Throughput |
|---|---|---|
| Single JSON, 1M, no indexes | 2.48 s | 400 k/sec |
| Single CSV, 1M, no indexes | 1.05 s | 950 k/sec |
| Parallel JSON, 5 conns × 200K, no indexes | 0.65 s | 1.54 M/sec (3.9× single JSON) |
| Parallel JSON, 5 conns × 200K, pre-existing 14 indexes | 2.81 s | 360 k/sec |
| Parallel CSV, 5 conns × 200K, no indexes | 0.37 s | 2.67 M/sec (2.8× single CSV) ← fastest |
| Parallel CSV, 5 conns × 200K, pre-existing 14 indexes | 2.40 s | 420 k/sec |
| Add 14 indexes after bulk insert | ~2.7 s | (per-shard parallel build) |
| Disk footprint (with 14 indexes) | 1.6 GB |
Parallel CSV no-idx hits 2.48 M/sec — 2.7× single CSV's 903 k/sec, and roughly 5× the previously published bash-bench number for this scenario. Pre-grow contributes a ~2× win on every path; the parallel-vs-single benefit is on top of that and hasn't disappeared. Use parallel for max throughput.
For indexed loads at 1M / 5×200K, parallel CSV with pre-existing indexes (2.30 s) beats parallel CSV no-idx + add-indexes (0.40 + 2.6 = 3.0 s) by 1.3×, and beats single-conn load-then-index (1.11 + 2.69 = 3.80 s) by 1.65×. So pre-existing-indexes parallel wins at this scale. The R² merge-cost rule (see chunk-size tuning below) means this advantage shrinks as you push R higher: at 10M with the same 200K chunks (R=50), the merge cycles dominate and load-then-index can pull ahead. Re-bench when you change scale.
Indexed bulk-insert chunk-size tuning¶
The per-shard btree layout (2026.05.1+) makes indexed bulk-insert sensitive to the number of bulk-insert REQUESTS, because each request triggers a sequential bulk_merge cycle per (field, shard). Cumulative extract work scales O(R²) where R is request count.
Recommended at 1M+ records: prefer fewer, larger bulk-insert calls. requests ≈ N / 200_000 with 5 ≤ connections remains a sensible floor: each request triggers one merge cycle per (field, shard), so packing more rows per request keeps the cycle count down. For non-indexed data loads, parallel always pays off (see §2); for indexed loads, parallel still helps, just by less because phase-4 dominates.
Pattern comparison for indexed batch ingest (1M / 5×200K)¶
C-bench measurements:
| Pattern | 1M × 14 idx total time | Throughput |
|---|---|---|
| Parallel CSV with pre-existing 14 idx (5 conns × 200K) | 2.30 s | 435 k/sec ← winner at this scale |
| Parallel CSV no-idx → add-indexes | 0.40 + 2.62 = 3.02 s | 331 k/sec |
| Single CSV no-idx → add-indexes | 1.11 + 2.69 = 3.80 s | 263 k/sec |
The earlier "load-then-index always wins" guidance came from bash measurements that under-counted parallel throughput; with C bench, parallel + pre-existing-indexes is the fastest path at 1M / 5×200K. The R² merge-cost rule still applies — if you scale to 10M with the same 200K chunks (R=50), the merge cycles balloon and load-then-index reclaims the lead. As a rule of thumb: at the recommended R ≈ N/200K chunk count, pre-existing-indexes parallel wins for R ≤ ~10, and load-then-index wins for R ≥ ~20. Bench at your scale.
Use load-then-index when you can afford to drop indexes during the load (static schemas, batch-ingest pipelines), at very large scales where R is high, or when operational simplicity matters. Use pre-existing-indexes parallel for streaming workloads or moderate batch sizes (1M ish) where R stays small.
Disk footprint¶
Per-shard btree layout adds ~25 % to indexed-object disk usage vs pre-2026.05.1 (1.3 → 1.6–1.7 GB on the invoice schema). Sources:
- Each btree starts at
2 × bt_page_size = 8 KB. With 14 indexes × 16 idx shards = 224 trees minimum, that's ~1.8 MB of header overhead before any data (vs 14 × 8 KB = 112 KB for the old single-tree layout). - Reduced prefix-compression effectiveness: each leaf page has 1/16 the entries to share prefixes with, so per-entry compression savings drop ~15–25 %.
- Page-allocation rounding: each btree's pages are 4 KB; trailing slack accumulates across 16× more trees.
Real space cost on production datasets typically lands at +20–30 % vs the legacy layout.
Notes¶
-
File-descriptor limit. At
SPLITS ≥ 512,ucache_grow_shardbriefly holds 2 fds per shard during migration, so peak can hit ~8,256 fds at the defaultFCACHE_MAX=4096. The server auto-raises its soft limit to the hard limit at startup (no privilege needed); if the hard limit itself is too low (shells default to 1024 on many distros), the startup WARN tells you exactly what to put in/etc/security/limits.confor asLimitNOFILE=in a systemd unit. -
CSV vs JSON. CSV bulk insert is faster because the CSV path parses directly against the mmap'd file via
(ptr, len)spans with zero per-line memcpy, while the JSON path materializes aJsonObjper record.
Splits — bench-derived numbers¶
Splits-sizing guidance lives in tuning.md → Sizing splits. The numbers below are the bench raw data that table is built from.
On the parallel K/V bench at 10M rows:
splits |
Records/shard | Wall time |
|---|---|---|
| 64 | 156K | 3.605s |
| 128 | 78K | 3.488s (fastest) |
| 256 | 39K | 3.986s |
| 1024 | 9K | 5.454s |
Counter-intuitively, raising splits beyond the sweet spot slows things down even at 10 parallel connections — more shard files = more syscalls and mmap page faults per query, and shard-lock contention isn't the bottleneck at this scale. If you exceed ~1M records/shard you've saturated this design — split across multiple objects (or tenant dirs) rather than climbing past MAX_SPLITS=4096.
Reproduce¶
./build.sh
# Each bench self-spawns its own daemon on a tmp DB_ROOT and tears down on exit.
# Scale via env vars (defaults match the published numbers in the tables above).
SHARD_BENCH_COUNT=10000000 ./build/bin/shard-db-bench run bench-kv # §1
SHARD_BENCH_TOTAL=10000000 SHARD_BENCH_CHUNK=2000000 \
./build/bin/shard-db-bench run bench-kv-parallel # §2
./build/bin/shard-db-bench run bench-queries # §3 (1M users)
./build/bin/shard-db-bench run bench-invoice # §4 (1M invoice)
./build/bin/shard-db-bench run bench-parallel # §5 (1M invoice parallel)
# Or run the full suite:
./build/bin/shard-db-bench run-all
Scale-override env vars: SHARD_BENCH_COUNT (single-conn benches), SHARD_BENCH_TOTAL + SHARD_BENCH_CHUNK (parallel benches), SHARD_BENCH_USERS + SHARD_BENCH_ORDERS (bench-joins). All sub-µs precision via clock_gettime(CLOCK_MONOTONIC).