Concurrency¶
shard-db is multi-threaded: a worker thread pool services TCP connections, scan paths parallelize across shards, and bulk-insert builds indexes in parallel. The locking model is fine-grained and avoids a global write lock.
Lock hierarchy (bottom up)¶
| Scope | Lock type | Purpose |
|---|---|---|
| Per ucache entry (one shard mmap) | rwlock | Readers share; a writer takes exclusive. |
| Per bt_cache entry (one btree mmap) | rwlock | Same model, separate cache. One entry per per-shard idx file. |
| Per object (logical) | rwlock ("objlock") | Normal ops take read; schema mutations take write. |
| Global maps (schemas, indexes, dirs) | mutex | Short-held, protects cache-lookup structures. |
| Process wide | atomic counters | in_flight_writes, active_threads — no locks, just atomics. |
Per-ucache-entry rwlock¶
Every mmapped shard file has its own rwlock. This is the hot path for both reads and writes:
- Reads (
get,find,count,aggregate, scans) take shared (read) on the shard they're touching. Multiple readers can scan the same shard simultaneously. - Writes (
insert,update,delete, index updates) take exclusive (write) on the shard they're modifying. A writer blocks readers only on that one shard.
Because records route by hash[0..1] % splits, an insert/update/delete touches exactly one shard. Other shards remain fully concurrent. Full scans parallelize across shards — one thread per shard group — and each thread locks only the shard it's reading.
Per-bt_cache-entry rwlock (per-shard btree, 2026.05.1+)¶
Each indexed field is sharded into splits/4 btree files (<obj>/indexes/<field>/<NNN>.idx). Every btree file has its own rwlock — same model as ucache, separate cache (BT_CACHE_MAX = FCACHE_MAX/4, derived). Writes route by record hash to a single idx-shard; reads fan out across all shards in parallel via the parallel_for worker pool.
This was the central reason for the per-shard layout. Pre-2026.05.1, a single <field>.idx file meant bulk_build (which truncates and rewrites the whole file) raced against in-flight readers holding an mmap of intermediate state. Per-file rwlocks give writers and readers proper isolation, and the parallel fan-out turns indexed lookups into N-way concurrent btree probes for free.
Per-object rwlock ("objlock")¶
Layered on top of the per-shard locks. Every JSON request gets classified:
- Normal ops (all CRUD, queries, bulk ops, index ops) →
objlock_rdlock(). Many concurrent. - Schema mutations (
add-field,remove-field,rename-field,vacuum --compact,vacuum --splits,truncate) →objlock_wrlock(). Blocks everyone; held only for the duration of the rebuild.
This serializes schema rebuilds against everything else without holding a long-lived lock during normal traffic.
Write drain on shutdown¶
./shard-db stop sets server_running = 0 to refuse new connections and waits up to 30 seconds for the in_flight_writes atomic to reach zero. This guarantees that every write that entered the server before shutdown either committed or returned an error — no half-written records.
Reads are not drained; they're safe to abandon.
Atomic write flag pattern¶
Each slot write is two steps:
- Write the new record payload into Zone B with
flag=0. - Atomically flip
flagto1in Zone A.
If the process dies between 1 and 2, the slot stays invisible to readers (flag=0 = empty). On next shard growth, it gets overwritten.
Updates replacing an existing slot payload use the same pattern: write-new-then-flip. Deletes flip flag=1 → flag=2 (tombstoned).
Parallel scan workers¶
For any find/count/aggregate without an index, scan_shards() spawns THREADS parallel workers (default = number of online CPUs). Each worker:
- Takes a shard group.
- Opens each shard's ucache entry (read lock).
- Walks Zone A linearly, loading Zone B only for candidates that match on metadata.
Aggregates accumulate into per-thread counters and fold at the end. No shared lock in the hot loop.
For indexed multi-criteria, parallel_indexed_count / parallel_indexed_agg walk the primary index's hits in parallel, with each worker filtering the remaining criteria against its slice.
Parallel index build¶
cmd_add_indexes with multiple fields does a single shard scan and emits tuples to per-field sort buffers in parallel. Then runs one worker per indexed field (idx_build_field_worker) — each worker buckets entries by idx-shard locally and merges them sequentially. Replaces the pre-2026.05.1 dispatch shape (14 fields × 16 idx-shards = 224 tasks) with a flat 14-task fan-out — fewer queue contention points, same total work.
Statement timeout¶
Set TIMEOUT=<seconds> in db.env for a global default, or pass "timeout_ms":N per request to override. Every scan loop calls query_deadline_tick() every 1024 iterations — coarse monotonic-clock check. When exceeded, the query returns {"error":"query_timeout"}. Precision is millisecond-accurate for long scans; the check granularity means a query finishes its current 1024-record chunk before actually stopping.
Per-query memory cap¶
QUERY_BUFFER_MB (default 500) caps the intermediate buffers any single query can hold. Checked at every collection site (ordered find buffer, aggregate hash tables, OR KeySet, AND-intersection KeySets, bulk-delete/update key list, btree hash collection). When exceeded → {"error":"query memory buffer exceeded; narrow criteria, add limit/offset, or stream via fetch+cursor"}. Prevents one bad query from monopolising RAM. Pair with whole-process containment (MemoryMax=, cgroup memory.max) as a backstop.
Crash consistency¶
- Writes — atomic flag flip (see above). Crash mid-write = invisible slot.
- Rebuilds (vacuum, growth, schema mutations) — build
.newfile, rename atomically. Crash before rename = original intact; crash after = new file in place. - Startup sweep — any leftover
.new/.oldfiles under any object are removed on startup before accepting connections.
What's NOT transactional¶
shard-db is not a multi-statement ACID database. There are no transactions across records. CAS is per-record (insert if_not_exists, update if:{...}) and gives you optimistic concurrency control without locks.
If you need to update two records atomically, your options are:
- Write-ahead log your intent into a third record as a staging step.
- Use bulk-update with criteria — still per-record CAS, but batched.
- Combine auto_update with a version:long:default=seq(...) field and compare-and-swap on version.
Connection scaling¶
Each connection runs on a worker thread (bounded by WORKERS, default = max(ncpu, 4)). The server uses epoll for accept and hands off ready sockets to the worker queue. Single connection is not a bottleneck — pipelining multiple JSON requests over one socket gets close to per-connection line rate.
Cache pressure: every connection allocates a MAX_REQUEST_SIZE-byte buffer. At the default 32 MB, 100 connections = 3.2 GB. Raise MAX_REQUEST_SIZE deliberately.