Skip to content

shard-db 2026.05.8

Correctness + perf release. Headline wins:

  • 4800× speedup on selective-filter + order_by queries when the matching record's order-key lands late in the sort walk (alice.smith0 order_by age desc went from 1737 ms to 0.36 ms on 25M).
  • Logging framework reshape — typed LOG_INFO/WARN/ERROR/DEBUG/AUDIT macros, new <date>-audit.log for security/admin events, structured subsystem field on every line. Operators with log-parsing regexes need to update.
  • Unknown-field queries now error loudly instead of silently returning empty results — closes a class of bugs where a typo or schema-drift caused find to walk 25M rows finding nothing in 38 seconds.
  • Several planner correctness fixes consolidated under pick_index_for_leaf — index dispatch is now decided at plan time, no runtime cross-index cascade.
  • cmd_drop_object 17–33 ms during warmup (was 3–20 s) — fixed a g_reg_lock-held-across-slotcask_open lock contention.
  • bulk-insert page-aligned SIGBUS fixed.
  • Warmup driven through cache APIs so the registry + kfcache are properly populated at startup.
  • New cookbook: docs/operations/bulk-loading.md with the R-crossover rule for load-then-index vs pre-existing-indexes at scale.

No ./migrate required. Wire-compatible with 2026.05.7.x. Drop in the new binary and restart.

Bug fixes

Selective-filter + order_by walked the whole order_by btree (#75, then refined in #79)

find criteria + order_by built a prefilter KeySet from the indexed criteria and walked the order_by btree end-to-end, calling keyset_contains(hash16) per entry. With a 1-match prefilter, this is O(N) — the walker visits every btree entry until it lands on the matching hash16. If the matching record's order-by value sat at the tail of a desc walk, that was 25M entries traversed for a limit 10 query.

Two-part fix:

  1. PR #75 unified the index-dispatch decision under a new pick_index_for_leaf(field, op, pattern) helper. Both the planner and the executor now consult the same picker — no runtime cascade between index types. Tiered KeySet allocation (start small, grow under budget) replaces the prior fixed-size build that would fall back to PRIMARY_NONE on overflow. Net win on the common case: eq username='X' order_by age desc limit 10 from 23 s to ~2 ms.

  2. PR #79 added a small-prefilter shortcut: when the prefilter KeySet has ≤ 1000 entries and there's an order_by, skip the btree walk entirely. Fetch each candidate record directly, decode the order-by field with typed_field_to_index_key (same memcmp-sortable encoding the btree uses), qsort, emit via the existing cursor_find_cb. K is bounded so this is O(K log K) regardless of N. The bench-cache-pollution run surfaced this when picking a different deterministic test username — same code path, two orders of magnitude apart based purely on which sort position the match happened to land in.

Unknown-field queries silently returned [] over 38 seconds (#77)

compile_one accepted any field name; if the typed schema didn't resolve it, cc->tf was left NULL and criteria_match returned false for every record. The query went through a full ordered walk finding 0 matches, returning [] in 38 s on 25M rows. The user observed this when a bench DB was missing the category field that the bench source referenced.

Fix: introduce a validate_criteria_tree_fields helper called at the entry of cmd_count / cmd_find / cmd_aggregate. Walks every criterion leaf, checks c->field against the typed schema, and also validates the RHS of OP_EQ_FIELD / OP_NEQ_FIELD / OP_LT_FIELD / OP_GT_FIELD / OP_LTE_FIELD / OP_GTE_FIELD. Composite-field sub-names (e.g. name+missing) are also validated — each sub-token must resolve. Order_by, group_by, projection, and aggregate spec fields validated too; aggregate order_by may name a typed field, an aggregate alias, or a group_by field.

Response now: {"error":"unknown field 'category' in criteria"} returned in microseconds instead of an empty array after a full scan.

cmd_drop_object took 3–20 s during warmup (#74)

slotcask_registry_get held g_reg_lock across the full slotcask_open call. For high-split objects (256 splits × 8 streams × 128 MiB mmap) the open takes seconds. While warmup workers concurrently opened hn objects under that lock, any unrelated cmd_drop_object calling slotcask_registry_invalidate waited for the open to finish.

Fix: classic open-without-lock pattern. Probe under lock, drop lock, slotcask_open outside, re-take lock + re-probe to install. Loser of an install race frees its own SlotcaskDb. Same shape as kfcache_acquire's miss path. Drops during warmup now 17–33 ms; 256-split create+drop after warmup is dominated by create cost.

cmd_bulk_insert SIGBUS on page-aligned payloads (#72)

When the JSON payload's size aligned to a page boundary, the memfd+mmap dance in cmd_bulk_insert produced a buffer where the byte one-past-the-end was on a non-mapped page. Any parser that read one byte past len (sentinel lookups in the JSON tokenizer) hit SIGBUS. Reproducible on payloads sized exactly 4096 × N bytes.

Fix: arena-allocate a len + 1 buffer with NUL terminator, copy the mmap contents in, parse from the arena. Eliminates the bug class for the server-side path. The CLI path keeps the original mmap fast path for genuine large-file loads — those payloads aren't crafted at page boundaries.

Composite-field validation gap (#77 follow-up commit)

Initial unknown-field validation rubber-stamped any name containing + because composite fields go through decode_field at runtime. But composite-field names only exist in the context of composite indexes; shard-db has no per-record dynamic-field semantics, so an unknown sub-name is the same bug class as a plain unknown field. Composite-field validation now splits on + and validates each subfield. Error message pinpoints the offending sub-name: unknown sub-field 'missing' in composite 'name+missing' (criteria).

Features

Logging framework reshape (#73)

  • Log line shape changed. Was 2026-05-24 21:11:54 [INFO] body; now 2026-05-24 21:11:54 INFO [subsystem] body. Subsystem is one of: server, slotcask, btree, bitmap, trigram, index, query, planner, vacuum, warmup, auth, tls, config, reindex, slow. Operators with log-parsing regexes need to update.
  • New per-level macros + subsystem registry in src/db/log.h: LOG_ERROR, LOG_WARN, LOG_INFO, LOG_DEBUG, LOG_AUDIT. Compiler enforces level + subsystem at every site (typo-safe). Legacy log_msg(int level, fmt, ...) removed — every internal caller migrated.
  • New audit log file. <date>-audit.log captures security/admin events: token add/remove, IP allow-list mutations, schema mutations (add/edit/remove/rename field), create-object, drop-object, add-index, remove-index, auth failures, admin-gate denials. Bypasses LOG_LEVEL filter so compliance review doesn't depend on operator config. Inherits LOG_RETAIN_DAYS retention.
  • Slow-query log reformatted. Keeps filename slow-YYYY-MM-DD.log but line shape aligns with the framework. Routes through the shared async ring buffer instead of direct fopen.
  • Audit-class events promoted from INFO/WARN to AUDIT. Every admin command + auth failure now lands in the audit file in addition to (or instead of) the regular per-level file.

KF_RESPLIT log demoted from raw stderr to LOG_INFO(LOG_SUB_SLOTCASK,...) (#78)

slotcask.c's resplit event used fprintf(stderr,...) directly, bypassing LOG_LEVEL filtering entirely. 256 lines × multiple resplits during a 25M bulk-insert spammed the bench terminal. Now routed through the log framework — lands in <date>-info.log, respects LOG_LEVEL.

Warmup drives the slotcask registry + kfcache via cache APIs (#71)

Pre-fix, warmup's pre-touch of kf and seg files happened OUTSIDE the cache layer — it ran posix_fadvise(WILLNEED) + a touch read on every shard's kf/seg files, but never installed them into the slotcask registry or kfcache. After warmup, the first user query still paid a cold registry insert.

Now warmup goes through slotcask_registry_get + kfcache_acquire so by the time it returns, the registry and kfcache have entries for every object — the first user query is genuinely warm.

posix_fadvise is gated behind __linux__ so macOS builds compile (#69 follow-up).

Performance

Bitmap cardinality probe before materializing prefilter (#76)

build_keyset_from_plan for an ordered-find prefilter on a bitmap-indexed field would walk the entire bitmap, materialize matching hash16s into a KeySet, then discard the KeySet upstream if it exceeded ORDERED_FIND_KEYSET_MAX (100k). For a ~5M-match eq category='books' criterion that was ~37 s of materialization thrown away every time.

Fix: probe the cardinality with bm_popcount_for_crit (~ms cold, no entries enumerated) before materializing. If popcount > cap, skip the build and return NULL — the per-record walk with limit short-circuit (which is what ordered-find already does when prefilter is NULL) wins on broad criteria anyway.

cmd_create_object parallel bitmap preallocation (in #73 bundle)

256-split × 3 bool-field create-object dropped from ~2.4 s to ~645 ms (~4×). Bitmap-shard preallocation now uses parallel_for; kf and seg were already parallel inside slotcask_open.

Server-side bulk-insert no longer routes through memfd+mmap (in #73 bundle)

Extracted the parser body into bulk_ins_run(data, len, ...). cmd_bulk_insert_string calls it directly with the in-memory JSON — drops ~5 syscalls per call and eliminates the page-aligned-mmap SIGBUS class by construction for the server path. CLI keeps the mmap path for genuine large-file loads.

pick_index_for_leaf unified dispatch (in #75)

Replaces the prior runtime cascade between bitmap → btree → trigram dispatchers with a single plan-time decision: per (field, op, pattern-length) tuple, pick exactly one index. No more silent "try one, fall through to another" — if the chosen index returns NULL, the caller drops to full scan. Fixes two latent silent-empty bugs (sub-3-char contains on trigram-only fields, bitmap empty-KeySet after the tiered allocation landed).

Small-prefilter ordered-find shortcut (in #79)

Already covered in Bug Fixes above. K ≤ 1000 + order_by → fetch + sort in memory. eq username='alice.smith0' order_by age desc limit 10: 1737 ms → 0.36 ms.

Operator migration notes

  • Log-parsing scripts that match \[ERROR\] etc. need to switch to ^[\d-]+ [\d:]+ ERROR \[. New per-level shape is 2026-05-25 14:33:21 ERROR [subsystem] body.
  • New <date>-audit.log file in LOG_DIR. Add to log shipping if you forward shard-db logs externally. Bypasses LOG_LEVEL.
  • No ./migrate required. No on-disk format changes; no wire-format changes.
  • Bulk-loading at scale: docs/operations/bulk-loading.md is new. The crossover rule (R = N/200K, load-then-index for R ≥ 20) is documented with both patterns and a SHARD_BENCH_DROP_INDEXES_FIRST=1 knob in bench-queries. If you're seeding > 4M records on a fresh object, drop indexes first → bulk-insert → multi-field add-index (plural). Don't loop singular add-index per field — that's N full scans.
  • Drift safety: bulk-insert and bulk-delete now emit LOG_ERROR("index.conf drift on object 'X': ...") if they ever encounter an indexed field that isn't in fields.conf. The audit confirmed this is unreachable today (all cmd_* schema mutations keep configs in sync), but future regressions will be loud instead of silent.

Testing

  • 3635/3635 C tests pass (was 3617 at 2026.05.7.1, +18 from new test_small_prefilter_orderby.c locking the shortcut's correctness across asc/desc/empty-match/range/offset/projection/dict-format).
  • New test_unknown_field_validation.c (12 assertions) locks the unknown-field error contract.
  • New bench_cache_pollution.c reproducer (not run by default; SHARD_BENCH_DB_ROOT=./db ./build/bin/shard-db-bench run bench-cache-pollution). Returned 0.55–1.57× ratios at 25M — within noise. Real test is at full-HN scale where working set + scan > RAM; backlog item documents the re-trigger.
  • New bench-queries env knob SHARD_BENCH_DROP_INDEXES_FIRST=1 to A/B the load-then-index seed pattern. Validated 25M insert at flat 0.62 ± 0.05 M/sec vs degrading 0.35 → 0.24 M/sec with pre-existing indexes.

What didn't change

  • Wire protocol — JSON-over-TCP, newline-delimited, \0\n-framed responses.
  • On-disk format — kf header, seg layout, btree BTRH, bitmap BM01, trigram .tg all unchanged.
  • CLI surface — same commands, same flags.
  • db.env — no new mandatory knobs; the WARMUP= knob added in 2026.05.7 still applies.