shard-db 2026.05.4¶
Query performance, concurrency safety, and macOS support. No protocol changes, no schema changes, no migration step — drop in the new binaries and restart.
Highlights¶
macOS (Apple Silicon) — first-class platform¶
Linux-isms swapped for portable equivalents:
epoll_create1/epoll_wait→poll()in the server accept loop (single listen fd; noepollselectivity to lose).memfd_create,posix_fallocate,MADV_HUGEPAGE,MAP_POPULATE,renameat2,CLOCK_MONOTONIC_COARSE— all#ifdef-guarded with macOS-appropriate fallbacks.flock/LOCK_EX/LOCK_UN— require_DARWIN_C_SOURCEon macOS to escape_POSIX_C_SOURCE's suppression of BSD extensions.<linux/limits.h>→<limits.h>+<sys/param.h>for portablePATH_MAX.build.sh— newBUILD_MARCHenv knob (removed hardcoded-march=native);-lncurses(not-lncursesw) on Darwin.- CI matrix expanded to
macos-latest(Apple Silicon). Release workflow now builds + cosign-signsshard-db-2026.05.4-darwin-arm64.tar.gzalongside the two Linux tarballs.
Root cause of the daemon-spawn cascade that took five CI rounds to localise: macOS default thread stack is 512 KB; Linux is 8 MB. The db-dirs request handler does char dirs_copy[2048][256] — exactly 512 KB on the stack. First TCP request → worker stack overflow → ARM64 guard page → SIGBUS. New db_thread_create() wrapper pre-sets all daemon thread stacks to 8 MB (matching Linux default). All six production pthread_create sites route through it.
Query performance — five fast-path landings¶
25M-row cold-bench numbers (post sync && drop_caches):
| Query shape | Before | After | Speedup |
|---|---|---|---|
sum X (single-spec, indexed numeric int/long/short/numeric/date) |
~2 s | ~200 ms | ~10× |
group by username, count limit 10 (high-card varchar idx) |
5.6 s | 3.6 ms | ~1570× |
group by email, sum(balance) limit 10 (varchar idx + indexed numeric agg) |
7.5 s | 4.1 ms | ~1800× |
Cold full-scan count starts bio 'Software' (non-idx varchar) |
1.3 s | ~800 ms | ~1.6× |
agg WHERE active=false (count+avg) |
2.7 s | 1.1 s | ~2.5× |
Implementation:
- Leaf-only walker for single-spec SUM/AVG. New
btree_walk_all_valuesinbtree.c— a tight forward leaf-chain walk that bypassesBtRangeIter's per-entry overhead (no hash memcpy, no bound check, no yield-buffer copy). Per-entry CPU drops from ~145 ns to ~50 ns. SUM/AVG route through it; MIN/MAX keep the iter path (short-circuit on first leaf entry). MADV_SEQUENTIALon btree leaf walks. Per-btree mmap isMADV_RANDOMat acquire (correct for point lookups). Sum/avg walks switch toMADV_SEQUENTIALfor the walk duration — coalesces 4 KB faults into 128 KB+ readahead — then restoreMADV_RANDOMat every exit path.POSIX_FADV_WILLNEEDwas tried first and didn't move cold sums; async prefetch races the walk's faults.MADV_SEQUENTIALon slotcask kf during full-shard walks. Same pattern on the kf walk that drives everyPRIMARY_NONEfull-scan query. Tried extending to seg files; reverted — shared-segment files plusMADV_SEQUENTIAL's "free after use" semantic caused cross-query cache eviction that regressed unrelated queries.- Streaming k-way merge for varchar
group_by+ count + limit. New path walks each idx_shard's btree leaves viaBtRangeIter, runs a k-way merge to dedup the same varchar across shards (idx_shards are hash16-routed; same value can land in multiple shards), emits(key, count)intoctx.ht, stops atlimit. Gates on single varchar group_by + COUNT-only + finite limit + no criteria/having/order_by. - Streaming-distinct extended to SUM/AVG/MIN/MAX. Same k-way merge, plus a per-emit
VSStagedslot that aggregates viaslotcask_lookup_by_hashof each contributing record's hash16. Gates on indexed non-varchar agg field; per-run cap (16384 records) aborts cleanly to the indexed-group-by path on low-cardinality data.
Concurrency audit — TSan + ASan clean (full suite)¶
Triggered by a TSan flake on one of the perf PRs. We ran a systematic pass under both sanitizers across all 77 test cases.
Real bugs fixed:
parallel_forhelp-drain race — caller's nested-path acquire-read ofremainingcould see 0 and proceed topthread_mutex_destroywhile the last worker was still betweenfetch_suband the post-broadcast unlock. Real crash potential under nested parallel calls. Added_Atomic int finishingtoPoolGroup; workers bump beforefetch_sub, decrement after broadcast; caller waits for bothremaining==0ANDfinishing==0before destroying.- 12 misc atomicity / memory-ordering fixes across
server.c,config.c,objlock.c,types.h,query.c:_Atomic intforserver_running,active_threads,in_flight_writes,g_log_running,g_scan_stop,QueryDeadline.timed_out;__atomic_*ops on the help-drain counters and worker-cfd table;localtime→localtime_r× 4 sites. - 3 sites in
agg_single_shard_workerwere leaking ~12 MB per query viaagg_ctx_mergenot freeing the merged-from local context.agg_ctx_free_localadded after each merge.
Suppressions added for known-safe lock-order false positives (bt_acquire, segcache_acquire, kfcache_acquire), seg_record_emit's intentional byte-races (correct under C11 since the flag is one byte), and one backlog item (slotcask_registry_invalidate UAF — real bug, queued separately; see slotcask_registry_uaf_backlog.md).
100 % zero-data-loss guarantee under the existing crash model is unchanged: every commit goes through the flag-flip + .new → activate → .old cleanup pattern with kernel-level mmap writeback. Sanitizer fixes were race-safety hardening, not durability fixes.
Upgrade notes¶
Drop-in. No data migration, no schema changes, no protocol changes.
Operators using ./migrate for prior releases don't need to run it again — 2026.05.4 is wire-compatible with 2026.05.1+.
Verification¶
cosign verify-blob \
--certificate-identity-regexp 'github\.com/sayyiditow/shard-db' \
--certificate-oidc-issuer https://token.actions.githubusercontent.com \
--bundle shard-db-2026.05.4-<platform>.tar.gz.bundle \
shard-db-2026.05.4-<platform>.tar.gz
<platform> ∈ {linux-x86_64, linux-arm64, darwin-arm64}.