Concurrency¶

shard-db is multi-threaded: a worker thread pool services TCP connections, scan paths parallelize across shards, and bulk-insert builds indexes in parallel. The locking model is fine-grained and avoids a global write lock.

Lock hierarchy (bottom up)¶

Scope	Lock type	Purpose
Per ucache entry (one shard mmap)	rwlock	Readers share; a writer takes exclusive.
Per bt_cache entry (one btree mmap)	rwlock	Same model, separate cache. One entry per per-shard idx file.
Per object (logical)	rwlock ("objlock")	Normal ops take read; schema mutations take write.
Global maps (schemas, indexes, dirs)	mutex	Short-held, protects cache-lookup structures.
Process wide	atomic counters	`in_flight_writes`, `active_threads` — no locks, just atomics.

Per-ucache-entry rwlock¶

Every mmapped shard file has its own rwlock. This is the hot path for both reads and writes:

Reads (get, find, count, aggregate, scans) take shared (read) on the shard they're touching. Multiple readers can scan the same shard simultaneously.
Writes (insert, update, delete, index updates) take exclusive (write) on the shard they're modifying. A writer blocks readers only on that one shard.

Because records route by hash[0..1] % splits, an insert/update/delete touches exactly one shard. Other shards remain fully concurrent. Full scans parallelize across shards — one thread per shard group — and each thread locks only the shard it's reading.

Per-bt_cache-entry rwlock (per-shard btree, 2026.05.1+)¶

Each indexed field is sharded into splits/4 btree files (<obj>/indexes/<field>/<NNN>.idx). Every btree file has its own rwlock — same model as ucache, separate cache (BT_CACHE_MAX = FCACHE_MAX/4, derived). Writes route by record hash to a single idx-shard; reads fan out across all shards in parallel via the parallel_for worker pool.

This was the central reason for the per-shard layout. Pre-2026.05.1, a single <field>.idx file meant bulk_build (which truncates and rewrites the whole file) raced against in-flight readers holding an mmap of intermediate state. Per-file rwlocks give writers and readers proper isolation, and the parallel fan-out turns indexed lookups into N-way concurrent btree probes for free.

Per-object rwlock ("objlock")¶

Layered on top of the per-shard locks. Every JSON request gets classified:

Normal ops (all CRUD, queries, bulk ops, index ops) → objlock_rdlock(). Many concurrent.
Schema mutations (add-field, remove-field, rename-field, vacuum --compact, vacuum --splits, truncate) → objlock_wrlock(). Blocks everyone; held only for the duration of the rebuild.

This serializes schema rebuilds against everything else without holding a long-lived lock during normal traffic.

Write drain on shutdown¶

./shard-db stop sets server_running = 0 to refuse new connections and waits up to 30 seconds for the in_flight_writes atomic to reach zero. This guarantees that every write that entered the server before shutdown either committed or returned an error — no half-written records.

Reads are not drained; they're safe to abandon.

Atomic write flag pattern¶

Each slot write is two steps:

Write the new record payload into Zone B with flag=0.
Atomically flip flag to 1 in Zone A.

If the process dies between 1 and 2, the slot stays invisible to readers (flag=0 = empty). On next shard growth, it gets overwritten.

Updates replacing an existing slot payload use the same pattern: write-new-then-flip. Deletes flip flag=1 → flag=2 (tombstoned).

Parallel scan workers¶

For any find/count/aggregate without an index, scan_shards() spawns THREADS parallel workers (default = number of online CPUs). Each worker:

Takes a shard group.
Opens each shard's ucache entry (read lock).
Walks Zone A linearly, loading Zone B only for candidates that match on metadata.

Aggregates accumulate into per-thread counters and fold at the end. No shared lock in the hot loop.

For indexed multi-criteria, parallel_indexed_count / parallel_indexed_agg walk the primary index's hits in parallel, with each worker filtering the remaining criteria against its slice.

Parallel index build¶

cmd_add_indexes with multiple fields does a single shard scan and emits tuples to per-field sort buffers in parallel. Then runs one worker per indexed field (idx_build_field_worker) — each worker buckets entries by idx-shard locally and merges them sequentially. Replaces the pre-2026.05.1 dispatch shape (14 fields × 16 idx-shards = 224 tasks) with a flat 14-task fan-out — fewer queue contention points, same total work.

Statement timeout¶

Set TIMEOUT=<seconds> in db.env for a global default, or pass "timeout_ms":N per request to override. Every scan loop calls query_deadline_tick() every 1024 iterations — coarse monotonic-clock check. When exceeded, the query returns {"error":"query_timeout"}. Precision is millisecond-accurate for long scans; the check granularity means a query finishes its current 1024-record chunk before actually stopping.

Per-query memory cap¶

QUERY_BUFFER_MB (default 500) caps the intermediate buffers any single query can hold. Checked at every collection site (ordered find buffer, aggregate hash tables, OR KeySet, AND-intersection KeySets, bulk-delete/update key list, btree hash collection). When exceeded → {"error":"query memory buffer exceeded; narrow criteria, add limit/offset, or stream via fetch+cursor"}. Prevents one bad query from monopolising RAM. Pair with whole-process containment (MemoryMax=, cgroup memory.max) as a backstop.

Crash consistency¶

Writes — atomic flag flip (see above). Crash mid-write = invisible slot.
Rebuilds (vacuum, growth, schema mutations) — build .new file, rename atomically. Crash before rename = original intact; crash after = new file in place.
Startup sweep — any leftover .new / .old files under any object are removed on startup before accepting connections.

What's NOT transactional¶

shard-db is not a multi-statement ACID database. There are no transactions across records. CAS is per-record (insert if_not_exists, update if:{...}) and gives you optimistic concurrency control without locks.

If you need to update two records atomically, your options are: - Write-ahead log your intent into a third record as a staging step. - Use bulk-update with criteria — still per-record CAS, but batched. - Combine auto_update with a version:long:default=seq(...) field and compare-and-swap on version.

Connection scaling¶

Each connection runs on a worker thread (bounded by WORKERS, default = max(ncpu, 4)). The server uses epoll for accept and hands off ready sockets to the worker queue. Single connection is not a bottleneck — pipelining multiple JSON requests over one socket gets close to per-connection line rate.

Cache pressure: every connection allocates a MAX_REQUEST_SIZE-byte buffer. At the default 32 MB, 100 connections = 3.2 GB. Raise MAX_REQUEST_SIZE deliberately.