shard-db 2026.05.8¶
Correctness + perf release. Headline wins:
- 4800× speedup on selective-filter + order_by queries when the matching record's order-key lands late in the sort walk (
alice.smith0 order_by age descwent from 1737 ms to 0.36 ms on 25M). - Logging framework reshape — typed
LOG_INFO/WARN/ERROR/DEBUG/AUDITmacros, new<date>-audit.logfor security/admin events, structuredsubsystemfield on every line. Operators with log-parsing regexes need to update. - Unknown-field queries now error loudly instead of silently returning empty results — closes a class of bugs where a typo or schema-drift caused
findto walk 25M rows finding nothing in 38 seconds. - Several planner correctness fixes consolidated under
pick_index_for_leaf— index dispatch is now decided at plan time, no runtime cross-index cascade. cmd_drop_object17–33 ms during warmup (was 3–20 s) — fixed ag_reg_lock-held-across-slotcask_openlock contention.bulk-insertpage-aligned SIGBUS fixed.- Warmup driven through cache APIs so the registry + kfcache are properly populated at startup.
- New cookbook:
docs/operations/bulk-loading.mdwith the R-crossover rule for load-then-index vs pre-existing-indexes at scale.
No ./migrate required. Wire-compatible with 2026.05.7.x. Drop in the new binary and restart.
Bug fixes¶
Selective-filter + order_by walked the whole order_by btree (#75, then refined in #79)¶
find criteria + order_by built a prefilter KeySet from the indexed criteria and walked the order_by btree end-to-end, calling keyset_contains(hash16) per entry. With a 1-match prefilter, this is O(N) — the walker visits every btree entry until it lands on the matching hash16. If the matching record's order-by value sat at the tail of a desc walk, that was 25M entries traversed for a limit 10 query.
Two-part fix:
-
PR #75 unified the index-dispatch decision under a new
pick_index_for_leaf(field, op, pattern)helper. Both the planner and the executor now consult the same picker — no runtime cascade between index types. Tiered KeySet allocation (start small, grow under budget) replaces the prior fixed-size build that would fall back to PRIMARY_NONE on overflow. Net win on the common case:eq username='X' order_by age desc limit 10from 23 s to ~2 ms. -
PR #79 added a small-prefilter shortcut: when the prefilter KeySet has ≤ 1000 entries and there's an
order_by, skip the btree walk entirely. Fetch each candidate record directly, decode the order-by field withtyped_field_to_index_key(same memcmp-sortable encoding the btree uses),qsort, emit via the existingcursor_find_cb. K is bounded so this isO(K log K)regardless of N. The bench-cache-pollution run surfaced this when picking a different deterministic test username — same code path, two orders of magnitude apart based purely on which sort position the match happened to land in.
Unknown-field queries silently returned [] over 38 seconds (#77)¶
compile_one accepted any field name; if the typed schema didn't resolve it, cc->tf was left NULL and criteria_match returned false for every record. The query went through a full ordered walk finding 0 matches, returning [] in 38 s on 25M rows. The user observed this when a bench DB was missing the category field that the bench source referenced.
Fix: introduce a validate_criteria_tree_fields helper called at the entry of cmd_count / cmd_find / cmd_aggregate. Walks every criterion leaf, checks c->field against the typed schema, and also validates the RHS of OP_EQ_FIELD / OP_NEQ_FIELD / OP_LT_FIELD / OP_GT_FIELD / OP_LTE_FIELD / OP_GTE_FIELD. Composite-field sub-names (e.g. name+missing) are also validated — each sub-token must resolve. Order_by, group_by, projection, and aggregate spec fields validated too; aggregate order_by may name a typed field, an aggregate alias, or a group_by field.
Response now: {"error":"unknown field 'category' in criteria"} returned in microseconds instead of an empty array after a full scan.
cmd_drop_object took 3–20 s during warmup (#74)¶
slotcask_registry_get held g_reg_lock across the full slotcask_open call. For high-split objects (256 splits × 8 streams × 128 MiB mmap) the open takes seconds. While warmup workers concurrently opened hn objects under that lock, any unrelated cmd_drop_object calling slotcask_registry_invalidate waited for the open to finish.
Fix: classic open-without-lock pattern. Probe under lock, drop lock, slotcask_open outside, re-take lock + re-probe to install. Loser of an install race frees its own SlotcaskDb. Same shape as kfcache_acquire's miss path. Drops during warmup now 17–33 ms; 256-split create+drop after warmup is dominated by create cost.
cmd_bulk_insert SIGBUS on page-aligned payloads (#72)¶
When the JSON payload's size aligned to a page boundary, the memfd+mmap dance in cmd_bulk_insert produced a buffer where the byte one-past-the-end was on a non-mapped page. Any parser that read one byte past len (sentinel lookups in the JSON tokenizer) hit SIGBUS. Reproducible on payloads sized exactly 4096 × N bytes.
Fix: arena-allocate a len + 1 buffer with NUL terminator, copy the mmap contents in, parse from the arena. Eliminates the bug class for the server-side path. The CLI path keeps the original mmap fast path for genuine large-file loads — those payloads aren't crafted at page boundaries.
Composite-field validation gap (#77 follow-up commit)¶
Initial unknown-field validation rubber-stamped any name containing + because composite fields go through decode_field at runtime. But composite-field names only exist in the context of composite indexes; shard-db has no per-record dynamic-field semantics, so an unknown sub-name is the same bug class as a plain unknown field. Composite-field validation now splits on + and validates each subfield. Error message pinpoints the offending sub-name: unknown sub-field 'missing' in composite 'name+missing' (criteria).
Features¶
Logging framework reshape (#73)¶
- Log line shape changed. Was
2026-05-24 21:11:54 [INFO] body; now2026-05-24 21:11:54 INFO [subsystem] body. Subsystem is one of:server,slotcask,btree,bitmap,trigram,index,query,planner,vacuum,warmup,auth,tls,config,reindex,slow. Operators with log-parsing regexes need to update. - New per-level macros + subsystem registry in
src/db/log.h:LOG_ERROR,LOG_WARN,LOG_INFO,LOG_DEBUG,LOG_AUDIT. Compiler enforces level + subsystem at every site (typo-safe). Legacylog_msg(int level, fmt, ...)removed — every internal caller migrated. - New audit log file.
<date>-audit.logcaptures security/admin events: token add/remove, IP allow-list mutations, schema mutations (add/edit/remove/rename field), create-object, drop-object, add-index, remove-index, auth failures, admin-gate denials. BypassesLOG_LEVELfilter so compliance review doesn't depend on operator config. InheritsLOG_RETAIN_DAYSretention. - Slow-query log reformatted. Keeps filename
slow-YYYY-MM-DD.logbut line shape aligns with the framework. Routes through the shared async ring buffer instead of direct fopen. - Audit-class events promoted from INFO/WARN to AUDIT. Every admin command + auth failure now lands in the audit file in addition to (or instead of) the regular per-level file.
KF_RESPLIT log demoted from raw stderr to LOG_INFO(LOG_SUB_SLOTCASK,...) (#78)¶
slotcask.c's resplit event used fprintf(stderr,...) directly, bypassing LOG_LEVEL filtering entirely. 256 lines × multiple resplits during a 25M bulk-insert spammed the bench terminal. Now routed through the log framework — lands in <date>-info.log, respects LOG_LEVEL.
Warmup drives the slotcask registry + kfcache via cache APIs (#71)¶
Pre-fix, warmup's pre-touch of kf and seg files happened OUTSIDE the cache layer — it ran posix_fadvise(WILLNEED) + a touch read on every shard's kf/seg files, but never installed them into the slotcask registry or kfcache. After warmup, the first user query still paid a cold registry insert.
Now warmup goes through slotcask_registry_get + kfcache_acquire so by the time it returns, the registry and kfcache have entries for every object — the first user query is genuinely warm.
posix_fadvise is gated behind __linux__ so macOS builds compile (#69 follow-up).
Performance¶
Bitmap cardinality probe before materializing prefilter (#76)¶
build_keyset_from_plan for an ordered-find prefilter on a bitmap-indexed field would walk the entire bitmap, materialize matching hash16s into a KeySet, then discard the KeySet upstream if it exceeded ORDERED_FIND_KEYSET_MAX (100k). For a ~5M-match eq category='books' criterion that was ~37 s of materialization thrown away every time.
Fix: probe the cardinality with bm_popcount_for_crit (~ms cold, no entries enumerated) before materializing. If popcount > cap, skip the build and return NULL — the per-record walk with limit short-circuit (which is what ordered-find already does when prefilter is NULL) wins on broad criteria anyway.
cmd_create_object parallel bitmap preallocation (in #73 bundle)¶
256-split × 3 bool-field create-object dropped from ~2.4 s to ~645 ms (~4×). Bitmap-shard preallocation now uses parallel_for; kf and seg were already parallel inside slotcask_open.
Server-side bulk-insert no longer routes through memfd+mmap (in #73 bundle)¶
Extracted the parser body into bulk_ins_run(data, len, ...). cmd_bulk_insert_string calls it directly with the in-memory JSON — drops ~5 syscalls per call and eliminates the page-aligned-mmap SIGBUS class by construction for the server path. CLI keeps the mmap path for genuine large-file loads.
pick_index_for_leaf unified dispatch (in #75)¶
Replaces the prior runtime cascade between bitmap → btree → trigram dispatchers with a single plan-time decision: per (field, op, pattern-length) tuple, pick exactly one index. No more silent "try one, fall through to another" — if the chosen index returns NULL, the caller drops to full scan. Fixes two latent silent-empty bugs (sub-3-char contains on trigram-only fields, bitmap empty-KeySet after the tiered allocation landed).
Small-prefilter ordered-find shortcut (in #79)¶
Already covered in Bug Fixes above. K ≤ 1000 + order_by → fetch + sort in memory. eq username='alice.smith0' order_by age desc limit 10: 1737 ms → 0.36 ms.
Operator migration notes¶
- Log-parsing scripts that match
\[ERROR\]etc. need to switch to^[\d-]+ [\d:]+ ERROR \[. New per-level shape is2026-05-25 14:33:21 ERROR [subsystem] body. - New
<date>-audit.logfile inLOG_DIR. Add to log shipping if you forward shard-db logs externally. BypassesLOG_LEVEL. - No
./migraterequired. No on-disk format changes; no wire-format changes. - Bulk-loading at scale:
docs/operations/bulk-loading.mdis new. The crossover rule (R = N/200K, load-then-index for R ≥ 20) is documented with both patterns and aSHARD_BENCH_DROP_INDEXES_FIRST=1knob inbench-queries. If you're seeding > 4M records on a fresh object, drop indexes first → bulk-insert → multi-fieldadd-index(plural). Don't loop singularadd-indexper field — that's N full scans. - Drift safety: bulk-insert and bulk-delete now emit
LOG_ERROR("index.conf drift on object 'X': ...")if they ever encounter an indexed field that isn't infields.conf. The audit confirmed this is unreachable today (allcmd_*schema mutations keep configs in sync), but future regressions will be loud instead of silent.
Testing¶
- 3635/3635 C tests pass (was 3617 at 2026.05.7.1, +18 from new
test_small_prefilter_orderby.clocking the shortcut's correctness across asc/desc/empty-match/range/offset/projection/dict-format). - New
test_unknown_field_validation.c(12 assertions) locks the unknown-field error contract. - New
bench_cache_pollution.creproducer (not run by default;SHARD_BENCH_DB_ROOT=./db ./build/bin/shard-db-bench run bench-cache-pollution). Returned 0.55–1.57× ratios at 25M — within noise. Real test is at full-HN scale where working set + scan > RAM; backlog item documents the re-trigger. - New
bench-queriesenv knobSHARD_BENCH_DROP_INDEXES_FIRST=1to A/B the load-then-index seed pattern. Validated 25M insert at flat 0.62 ± 0.05 M/sec vs degrading 0.35 → 0.24 M/sec with pre-existing indexes.
What didn't change¶
- Wire protocol — JSON-over-TCP, newline-delimited,
\0\n-framed responses. - On-disk format — kf header, seg layout, btree BTRH, bitmap BM01, trigram .tg all unchanged.
- CLI surface — same commands, same flags.
db.env— no new mandatory knobs; theWARMUP=knob added in 2026.05.7 still applies.