Skip to content

shard-db 2026.05.7

Feature release bundling three new index types (bitmap, enum, trigram), a bounded query concurrency cap, and a CPD-driven dedup sweep.

The headline is trigram: substring search (contains / i_contains) on varchar fields, opt-in per field. Combined with the existing btree, the planner picks per-query — btree-leaf scan for short common patterns (< 6 chars), trigram for selective longer patterns. The whole index-build pipeline now uses an external-merge sort with bounded per-worker memory, so reindex on multi-GB varchar corpora stays predictable instead of OOMing at scale.

The supporting story is memory predictability. With MAX_CONCURRENT_QUERIES capping in-flight queries at max(4, min(nproc, 32)) by default and QUERY_BUFFER_MB dropping from 500 to 256, the worst-case query-buffer RAM is now a predictable slots × per_query number that operators can size against — not unbounded as it was when N TCP threads could each take 500 MB simultaneously.

What's new

Trigram index (PR #60)

Opt-in via field:trigram on varchar:

{"mode":"add-index","dir":"hn","object":"stories","field":"title:trigram"}
  • Storage: <obj>/indexes/<field>/<NNN>.tg — one entry per (record × distinct trigram in lowercased value), BTRH btree format. Shares bt_cache + tooling with .idx.
  • Query path: extract trigrams from pattern, look up posting lists per trigram, intersect rarest-first, then per-record verify the substring (kills false positives from order-insensitive trigram matching).
  • Planner heuristic: when a field carries both btree and trigram, short patterns (< 6 chars) route to btree-leaf scan (faster on common substrings like "ai", "the"), longer patterns route to trigram (faster on selective queries like full URLs or technical terms). Field-only trigram uses trigram regardless of length.
  • Disk cost: ~5× the equivalent btree on the same field; use estimate-index to project before committing:
{"mode":"estimate-index","dir":"hn","object":"stories","spec":"title:trigram"}

Returns projected_bytes from a 1024-record sample.

When trigram earns its keep: large-corpus substring search on selective patterns (e.g. searching a 30M-row title field for a specific URL or proper noun). For non-selective common patterns, the planner correctly skips trigram and uses btree-leaf scan.

Bitmap index (PR #56)

  • Auto-default for bool and enum fields. Bool fields declared flag:bool index as bitmap automatically; enums (new in this release) also auto-bitmap.
  • Opt-in for low-cardinality varchar: field:bitmap or field:bitmap(N) where N ∈ [2, 65535] overrides the default 256-value cap.
  • Storage: <obj>/indexes/<field>/<NNN>.bm — one bit per slot per distinct value, 1:1 with data shards (not the btree fan-out curve). Bit i of shard s = "slot i in data shard s has this value".
  • Query path: popcount-style fast path for eq / in / neq / not_in; per-shard dict-scan for every other op (≤ cap dict entries iterated, decoded, matched, then the matching value-bitmaps walked). All paths go through the index file, never the data file.

Enum field type (PR #57, fix #58)

New typed field declared as field:enum(a,b,c,...). Width-adaptive storage (1 byte for ≤256 values, 2 bytes for >256), auto-promote to bitmap index, full edit-field support, strict create-object default (unknown values rejected at insert time).

Bounded-memory index build pipeline (PR #60)

Index builds now use an external-merge sort: per-kf-shard parallel walks spill sorted runs to temp files; per-output-shard k-way merge feeds a streaming bt_stream_build_* into the final btree. Per-worker memory stays at a few MB regardless of input size.

Validated at 25M records with 12 indexed fields. Pre-fix the in-memory accumulation OOMed at this scale; post-fix the same workload completes in ~70s for trigram, ~8s each for btree/bitmap.

Bounded query concurrency (PR #61)

New MAX_CONCURRENT_QUERIES knob (db.env):

MAX_CONCURRENT_QUERIES=0           # 0 = auto = max(4, min(nproc, 32))
QUERY_BUFFER_MB=256                # lowered from 500

Implemented via sem_trywait semaphore at dispatch_json_query entry, released via __attribute__((cleanup)) on any exit path. When the cap is reached:

{"error":"server at capacity","max_concurrent_queries":16}

The client retries; the TCP thread isn't held under load. Worst-case query-buffer RAM is now MAX_CONCURRENT_QUERIES × QUERY_BUFFER_MB — predictable for sizing. Pair with cgroup / systemd MemoryMax as the final OS-level guard.

The QUERY_BUFFER_MB auto-tune still kicks in when the value is at the default, but now divides the process-wide query budget (min 25% RAM, 4 GB ceiling) by the resolved slot count instead of letting one query grab all of it.

Filter-first planner for find + order_by (PR #63)

A query like find icontains 'kubernetes' order_by time desc limit 25 over 789K comments used to take 14 seconds — the planner walked the entire time btree and ran criteria_match on every record to confirm 0 matches. The same count returned in 76 ms via trigram, exposing the gap.

The fix generalises across all index types: a new build_keyset_from_plan dispatcher wraps the existing builders (trigram, bitmap, btree-leaf, AND-intersect, OR-union) behind one entry point. cursor_find_cb gains a prefilter_ks field — when non-NULL, it skips fetch + criteria_match for entries not in the KeySet. Applies to both the cursor pagination path and the non-cursor ordered-walk fast path.

Threshold ORDERED_FIND_KEYSET_MAX = 100K — broader filters fall back to legacy walk-ordered (which early-exits at limit when matches are dense, preserving that path's win for broad-filter shapes).

Bench (warm cache, 789K comments):

query before after win
kubernetes (0 matches) 14179 ms 73 ms 194×
docker (2 scattered) 14131 ms 655 ms 22×
peter (dense matches) 82 ms 16 ms
postgres (25 matches) 589 ms 113 ms

Bonus fix: empty-KeySet short-circuit previously emitted [] after the caller already emitted [, producing [[]. Fixed to emit envelope-close only.

Code quality (PR #59, PR #62)

  • CPD-driven dedup: consolidated bench_du_bytes / bench_fmt_bytes into a shared bench header, extracted storage.c multi_bucket_dispatch helper (3 callers → 1), extracted query.c emit_min_max_via_keyset helper (2 cmd_aggregate branches → 1). 130 LOC removed across the two files; behavior unchanged.
  • CodeQL #94: test bulk-insert builder no longer uses the unsafe off += snprintf(buf+off, cap-off, ...) idiom. Replaced with an explicit underflow-guarded APPEND macro.

Migration

Wire format unchanged. On-disk format unchanged. No ./migrate required for upgrade from 2026.05.5 / 2026.05.6 → 2026.05.7.

Existing objects continue to work as-is — opt into the new index types via add-index on the fields where it makes sense. Bool fields added via add-field after the upgrade auto-bitmap; existing bool fields without an explicit index don't auto-promote (no schema mutation runs on upgrade).

Test coverage

85 cases / 3483 assertions, 0 failures. Bench validated at 25M-user scale on AMD Ryzen 7 7840U + NVMe ext4 cold cache:

  • Trigram contains query (long pattern, 12 chars): ~260ms warm (selective; trigram path)
  • Trigram contains query (short pattern, 5 chars): ~217ms warm (planner picks btree-leaf)
  • Bitmap eq on indexed bool (25M): ~8 seconds total reindex
  • Trigram reindex on 14-char username (25M): ~70 seconds, ~8 GB on-disk per field