Skip to content

shard-db 2026.05.4

Query performance, concurrency safety, and macOS support. No protocol changes, no schema changes, no migration step — drop in the new binaries and restart.

Highlights

macOS (Apple Silicon) — first-class platform

Linux-isms swapped for portable equivalents:

  • epoll_create1 / epoll_waitpoll() in the server accept loop (single listen fd; no epoll selectivity to lose).
  • memfd_create, posix_fallocate, MADV_HUGEPAGE, MAP_POPULATE, renameat2, CLOCK_MONOTONIC_COARSE — all #ifdef-guarded with macOS-appropriate fallbacks.
  • flock / LOCK_EX / LOCK_UN — require _DARWIN_C_SOURCE on macOS to escape _POSIX_C_SOURCE's suppression of BSD extensions.
  • <linux/limits.h><limits.h> + <sys/param.h> for portable PATH_MAX.
  • build.sh — new BUILD_MARCH env knob (removed hardcoded -march=native); -lncurses (not -lncursesw) on Darwin.
  • CI matrix expanded to macos-latest (Apple Silicon). Release workflow now builds + cosign-signs shard-db-2026.05.4-darwin-arm64.tar.gz alongside the two Linux tarballs.

Root cause of the daemon-spawn cascade that took five CI rounds to localise: macOS default thread stack is 512 KB; Linux is 8 MB. The db-dirs request handler does char dirs_copy[2048][256] — exactly 512 KB on the stack. First TCP request → worker stack overflow → ARM64 guard page → SIGBUS. New db_thread_create() wrapper pre-sets all daemon thread stacks to 8 MB (matching Linux default). All six production pthread_create sites route through it.

Query performance — five fast-path landings

25M-row cold-bench numbers (post sync && drop_caches):

Query shape Before After Speedup
sum X (single-spec, indexed numeric int/long/short/numeric/date) ~2 s ~200 ms ~10×
group by username, count limit 10 (high-card varchar idx) 5.6 s 3.6 ms ~1570×
group by email, sum(balance) limit 10 (varchar idx + indexed numeric agg) 7.5 s 4.1 ms ~1800×
Cold full-scan count starts bio 'Software' (non-idx varchar) 1.3 s ~800 ms ~1.6×
agg WHERE active=false (count+avg) 2.7 s 1.1 s ~2.5×

Implementation:

  • Leaf-only walker for single-spec SUM/AVG. New btree_walk_all_values in btree.c — a tight forward leaf-chain walk that bypasses BtRangeIter's per-entry overhead (no hash memcpy, no bound check, no yield-buffer copy). Per-entry CPU drops from ~145 ns to ~50 ns. SUM/AVG route through it; MIN/MAX keep the iter path (short-circuit on first leaf entry).
  • MADV_SEQUENTIAL on btree leaf walks. Per-btree mmap is MADV_RANDOM at acquire (correct for point lookups). Sum/avg walks switch to MADV_SEQUENTIAL for the walk duration — coalesces 4 KB faults into 128 KB+ readahead — then restore MADV_RANDOM at every exit path. POSIX_FADV_WILLNEED was tried first and didn't move cold sums; async prefetch races the walk's faults.
  • MADV_SEQUENTIAL on slotcask kf during full-shard walks. Same pattern on the kf walk that drives every PRIMARY_NONE full-scan query. Tried extending to seg files; reverted — shared-segment files plus MADV_SEQUENTIAL's "free after use" semantic caused cross-query cache eviction that regressed unrelated queries.
  • Streaming k-way merge for varchar group_by + count + limit. New path walks each idx_shard's btree leaves via BtRangeIter, runs a k-way merge to dedup the same varchar across shards (idx_shards are hash16-routed; same value can land in multiple shards), emits (key, count) into ctx.ht, stops at limit. Gates on single varchar group_by + COUNT-only + finite limit + no criteria/having/order_by.
  • Streaming-distinct extended to SUM/AVG/MIN/MAX. Same k-way merge, plus a per-emit VSStaged slot that aggregates via slotcask_lookup_by_hash of each contributing record's hash16. Gates on indexed non-varchar agg field; per-run cap (16384 records) aborts cleanly to the indexed-group-by path on low-cardinality data.

Concurrency audit — TSan + ASan clean (full suite)

Triggered by a TSan flake on one of the perf PRs. We ran a systematic pass under both sanitizers across all 77 test cases.

Real bugs fixed:

  • parallel_for help-drain race — caller's nested-path acquire-read of remaining could see 0 and proceed to pthread_mutex_destroy while the last worker was still between fetch_sub and the post-broadcast unlock. Real crash potential under nested parallel calls. Added _Atomic int finishing to PoolGroup; workers bump before fetch_sub, decrement after broadcast; caller waits for both remaining==0 AND finishing==0 before destroying.
  • 12 misc atomicity / memory-ordering fixes across server.c, config.c, objlock.c, types.h, query.c: _Atomic int for server_running, active_threads, in_flight_writes, g_log_running, g_scan_stop, QueryDeadline.timed_out; __atomic_* ops on the help-drain counters and worker-cfd table; localtimelocaltime_r × 4 sites.
  • 3 sites in agg_single_shard_worker were leaking ~12 MB per query via agg_ctx_merge not freeing the merged-from local context. agg_ctx_free_local added after each merge.

Suppressions added for known-safe lock-order false positives (bt_acquire, segcache_acquire, kfcache_acquire), seg_record_emit's intentional byte-races (correct under C11 since the flag is one byte), and one backlog item (slotcask_registry_invalidate UAF — real bug, queued separately; see slotcask_registry_uaf_backlog.md).

100 % zero-data-loss guarantee under the existing crash model is unchanged: every commit goes through the flag-flip + .new → activate → .old cleanup pattern with kernel-level mmap writeback. Sanitizer fixes were race-safety hardening, not durability fixes.

Upgrade notes

Drop-in. No data migration, no schema changes, no protocol changes.

Operators using ./migrate for prior releases don't need to run it again — 2026.05.4 is wire-compatible with 2026.05.1+.

Verification

cosign verify-blob \
  --certificate-identity-regexp 'github\.com/sayyiditow/shard-db' \
  --certificate-oidc-issuer https://token.actions.githubusercontent.com \
  --bundle shard-db-2026.05.4-<platform>.tar.gz.bundle \
  shard-db-2026.05.4-<platform>.tar.gz

<platform>{linux-x86_64, linux-arm64, darwin-arm64}.

Full changelog

docs/reference/changelog.md#2026054