shard-db 2026.05.7¶
Feature release bundling three new index types (bitmap, enum, trigram), a bounded query concurrency cap, and a CPD-driven dedup sweep.
The headline is trigram: substring search (contains / i_contains) on varchar fields, opt-in per field. Combined with the existing btree, the planner picks per-query — btree-leaf scan for short common patterns (< 6 chars), trigram for selective longer patterns. The whole index-build pipeline now uses an external-merge sort with bounded per-worker memory, so reindex on multi-GB varchar corpora stays predictable instead of OOMing at scale.
The supporting story is memory predictability. With MAX_CONCURRENT_QUERIES capping in-flight queries at max(4, min(nproc, 32)) by default and QUERY_BUFFER_MB dropping from 500 to 256, the worst-case query-buffer RAM is now a predictable slots × per_query number that operators can size against — not unbounded as it was when N TCP threads could each take 500 MB simultaneously.
What's new¶
Trigram index (PR #60)¶
Opt-in via field:trigram on varchar:
- Storage:
<obj>/indexes/<field>/<NNN>.tg— one entry per (record × distinct trigram in lowercased value), BTRH btree format. Sharesbt_cache+ tooling with.idx. - Query path: extract trigrams from pattern, look up posting lists per trigram, intersect rarest-first, then per-record verify the substring (kills false positives from order-insensitive trigram matching).
- Planner heuristic: when a field carries both btree and trigram, short patterns (< 6 chars) route to btree-leaf scan (faster on common substrings like "ai", "the"), longer patterns route to trigram (faster on selective queries like full URLs or technical terms). Field-only trigram uses trigram regardless of length.
- Disk cost: ~5× the equivalent btree on the same field; use
estimate-indexto project before committing:
Returns projected_bytes from a 1024-record sample.
When trigram earns its keep: large-corpus substring search on selective patterns (e.g. searching a 30M-row title field for a specific URL or proper noun). For non-selective common patterns, the planner correctly skips trigram and uses btree-leaf scan.
Bitmap index (PR #56)¶
- Auto-default for
boolandenumfields. Bool fields declaredflag:boolindex as bitmap automatically; enums (new in this release) also auto-bitmap. - Opt-in for low-cardinality varchar:
field:bitmaporfield:bitmap(N)whereN ∈ [2, 65535]overrides the default 256-value cap. - Storage:
<obj>/indexes/<field>/<NNN>.bm— one bit per slot per distinct value, 1:1 with data shards (not the btree fan-out curve). Bitiof shards= "slotiin data shardshas this value". - Query path: popcount-style fast path for
eq/in/neq/not_in; per-shard dict-scan for every other op (≤ cap dict entries iterated, decoded, matched, then the matching value-bitmaps walked). All paths go through the index file, never the data file.
Enum field type (PR #57, fix #58)¶
New typed field declared as field:enum(a,b,c,...). Width-adaptive storage (1 byte for ≤256 values, 2 bytes for >256), auto-promote to bitmap index, full edit-field support, strict create-object default (unknown values rejected at insert time).
Bounded-memory index build pipeline (PR #60)¶
Index builds now use an external-merge sort: per-kf-shard parallel walks spill sorted runs to temp files; per-output-shard k-way merge feeds a streaming bt_stream_build_* into the final btree. Per-worker memory stays at a few MB regardless of input size.
Validated at 25M records with 12 indexed fields. Pre-fix the in-memory accumulation OOMed at this scale; post-fix the same workload completes in ~70s for trigram, ~8s each for btree/bitmap.
Bounded query concurrency (PR #61)¶
New MAX_CONCURRENT_QUERIES knob (db.env):
Implemented via sem_trywait semaphore at dispatch_json_query entry, released via __attribute__((cleanup)) on any exit path. When the cap is reached:
The client retries; the TCP thread isn't held under load. Worst-case query-buffer RAM is now MAX_CONCURRENT_QUERIES × QUERY_BUFFER_MB — predictable for sizing. Pair with cgroup / systemd MemoryMax as the final OS-level guard.
The QUERY_BUFFER_MB auto-tune still kicks in when the value is at the default, but now divides the process-wide query budget (min 25% RAM, 4 GB ceiling) by the resolved slot count instead of letting one query grab all of it.
Filter-first planner for find + order_by (PR #63)¶
A query like find icontains 'kubernetes' order_by time desc limit 25 over 789K comments used to take 14 seconds — the planner walked the entire time btree and ran criteria_match on every record to confirm 0 matches. The same count returned in 76 ms via trigram, exposing the gap.
The fix generalises across all index types: a new build_keyset_from_plan dispatcher wraps the existing builders (trigram, bitmap, btree-leaf, AND-intersect, OR-union) behind one entry point. cursor_find_cb gains a prefilter_ks field — when non-NULL, it skips fetch + criteria_match for entries not in the KeySet. Applies to both the cursor pagination path and the non-cursor ordered-walk fast path.
Threshold ORDERED_FIND_KEYSET_MAX = 100K — broader filters fall back to legacy walk-ordered (which early-exits at limit when matches are dense, preserving that path's win for broad-filter shapes).
Bench (warm cache, 789K comments):
| query | before | after | win |
|---|---|---|---|
kubernetes (0 matches) |
14179 ms | 73 ms | 194× |
docker (2 scattered) |
14131 ms | 655 ms | 22× |
peter (dense matches) |
82 ms | 16 ms | 5× |
postgres (25 matches) |
589 ms | 113 ms | 5× |
Bonus fix: empty-KeySet short-circuit previously emitted [] after the caller already emitted [, producing [[]. Fixed to emit envelope-close only.
Code quality (PR #59, PR #62)¶
- CPD-driven dedup: consolidated
bench_du_bytes/bench_fmt_bytesinto a shared bench header, extractedstorage.c multi_bucket_dispatchhelper (3 callers → 1), extractedquery.c emit_min_max_via_keysethelper (2 cmd_aggregate branches → 1). 130 LOC removed across the two files; behavior unchanged. - CodeQL #94: test bulk-insert builder no longer uses the unsafe
off += snprintf(buf+off, cap-off, ...)idiom. Replaced with an explicit underflow-guardedAPPENDmacro.
Migration¶
Wire format unchanged. On-disk format unchanged. No ./migrate required for upgrade from 2026.05.5 / 2026.05.6 → 2026.05.7.
Existing objects continue to work as-is — opt into the new index types via add-index on the fields where it makes sense. Bool fields added via add-field after the upgrade auto-bitmap; existing bool fields without an explicit index don't auto-promote (no schema mutation runs on upgrade).
Test coverage¶
85 cases / 3483 assertions, 0 failures. Bench validated at 25M-user scale on AMD Ryzen 7 7840U + NVMe ext4 cold cache:
- Trigram contains query (long pattern, 12 chars): ~260ms warm (selective; trigram path)
- Trigram contains query (short pattern, 5 chars): ~217ms warm (planner picks btree-leaf)
- Bitmap
eqon indexed bool (25M): ~8 seconds total reindex - Trigram reindex on 14-char username (25M): ~70 seconds, ~8 GB on-disk per field