Schema mutations¶
Evolve object schemas without downtime. All mutations are atomic at the object level — a per-object write-lock blocks normal ops briefly during the rebuild.
create-object¶
Create a new typed object. See Quick start for a full example.
{
"mode": "create-object",
"dir": "<dir>",
"object": "<obj>",
"splits": 16,
"max_key": 128,
"fields": [
"name:varchar:100",
"age:int",
"balance:numeric:12,2",
"active:bool"
],
"indexes": ["name","age"]
}
| Param | Required | Default | Meaning |
|---|---|---|---|
dir |
yes | — | Tenant dir. Auto-registered if new. |
object |
yes | — | Object name. |
splits |
no | 8 (DEFAULT_SPLITS) |
Initial shard count. Must be a power of 2 in [8, 4096] (MIN_SPLITS–MAX_SPLITS). The default is tuned for sub-1M-row objects; pass splits explicitly for larger workloads. |
max_key |
no | 64 |
Max key length in bytes. Hard ceiling 1024 (MAX_KEY_CEILING). |
fields |
yes | — | Array of typed field specs. See Concepts → Typed records. |
indexes |
no | [] |
Fields to index at creation. Single or composite ("a+b"). |
auto_key |
no | (none) | Opt into server-generated keys at insert time. "uuid" → 16-byte UUIDv4 binary, rendered as 36-char dashed string on read (requires max_key >= 16). "seq(<name>)" → 8-byte int64 BE from a named sequence, rendered as decimal string on read (requires max_key >= 8; sequence is pre-initialised to 0 if absent, first next returns 1). Set once at create-object and immutable for the object's life — there is no set-auto-key mutation. See Auto-generated keys below. |
value_size (the per-record payload size, stored in segment files) is always computed as the sum of typed-field sizes — not user-configurable. Returned in create-object and describe-object responses; recorded internally for slot-size accounting.
Response: {"status":"created","object":"...","splits":N,"max_key":N,"value_size":N,"fields":N}.
Auto-generated keys¶
Declare "auto_key":"uuid" or "auto_key":"seq(<name>)" at create-object to turn on server-generated keys.
{
"mode":"create-object","dir":"<d>","object":"users",
"splits":16,"max_key":16,
"fields":["name:varchar:64"],
"auto_key":"uuid"
}
Insert with the key omitted → server generates per the object's mode, returns the rendered key:
// request
{"mode":"insert","dir":"<d>","object":"users","value":{"name":"Alice"}}
// response
{"status":"inserted","key":"7a8c2f12-9d31-4abc-9c4a-1a2b3c4d5e6f"}
Insert with a key provided → upsert (exists → update, else → insert). The provided key must parse in the rendered form (36-char dashed UUID or decimal int).
// request
{"mode":"insert","dir":"<d>","object":"orders","key":"42","value":{"amount":99}}
// response
{"status":"inserted","key":"42"}
CAS modifiers (if_not_exists, if) apply to provided-key inserts as usual. Omit-key + if:{...} is rejected at parse time (the predicate applies to a specific record; auto-gen doesn't compose).
Storage shape — keys are stored in their on-disk binary form: 16 bytes for uuid, 8 bytes BE int64 for seq. Wire I/O always renders as the canonical string form (UUID dashed, seq decimal). get / delete / find / keys / fetch all accept and emit the rendered form.
bulk-insert — per-record omit-key gets auto-generated; per-record provided-key upserts. The whole batch is refused up front if any provided key is malformed. Generated keys for the batch are allocated in one shot (single /dev/urandom read for UUID; single seq flock for seq) so per-record overhead stays low. Per-record CAS is enforced: omit-key records take the strict-insert path (collision → that single record is condition_not_met and counted in skipped, the other records still write) while provided-key records remain upsert. The check piggybacks on the existing kf-lookup pass — zero extra lookups, zero added latency for non-auto-key bulk-insert.
// request
{"mode":"bulk-insert","dir":"<d>","object":"orders",
"records":[{"value":{"amount":1}},
{"key":"500","value":{"amount":500}},
{"value":{"amount":2}}]}
// response — keys[] preserves input order
{"status":"bulk-inserted","count":3,"skipped":0,"keys":["4","500","5"]}
The dict form ({"k1":{...},"k2":{...}}) has keys baked into the wire shape — every entry is treated as provided-key.
bulk-insert-delimited (CSV / TSV / pipe / etc.) also supports auto-key: per row, an empty first column means "auto-generate" and a non-empty first column is parsed as a wire-form key (upsert). Response shape matches the JSON form ({"status":"bulk-inserted","count":N,"skipped":M,"keys":[...]} for auto-key objects). When using inline data in a JSON request, the standard JSON escapes (\n, \r, \t, \", \\, \uXXXX) are decoded before parsing — so newline-separated records work as expected.
{"mode":"bulk-insert-delimited","dir":"<d>","object":"orders",
"delimiter":",",
"data":",100\n42,42\n,200\n"}
// response (seq watermark was at 5):
{"status":"bulk-inserted","count":3,"skipped":0,"keys":["6","42","7"]}
update / delete require a key as today. auto_key only fires on insert; update with no key errors with the usual "Missing key" message.
Constraints:
uuidmode →max_keymust be at least 16.seq(<name>)mode →max_keymust be at least 8.- The sequence name must be valid (
valid_field_name()rules — no:,+,/, spaces, parens). The sequence file lives at<obj>/metadata/sequences/<name>and is shared with any field that also uses:default=seq(<name>). auto_keyis persisted as the trailing token on the schema.conf line:dir:object:splits:max_key:2:streams:auto_key=....- No retroactive enable — auto_key can only be set at create-object. There is no schema mutation to add or change auto_key later. Revisit only if customers need it.
- Seq collisions — for
seqmode, if you manually insert records with numeric keys at or above the current sequence value, the next auto-generated insert can collide. Single insert returns{"error":"condition_not_met"}for that record; bulk-insert (JSON + delimited) skips just the colliding record (skipped:Nin the response) and inserts every other auto-gen normally — the manual record's data is never silently overwritten. UUID collisions are effectively impossible at any realistic scale.
add-field¶
Append new fields to an existing object.
{
"mode": "add-field",
"dir": "<dir>",
"object": "<obj>",
"fields": [
"phone:varchar:20",
"verified:bool:default=false",
"rowid:long:default=seq(rowid)",
"trace_id:varchar:36:default=uuid()",
"nonce:varchar:18:default=random(8)"
]
}
What happens¶
- Takes the object's write lock (
objlock_wrlock). - Builds a new shard layout with the extra fields appended.
- Backfill pass — re-encodes every existing record: prior field values preserved, new fields stamped with their computed default (see table below).
- Atomically swaps (
.new→ original rename). - Rebuilds indexes (none referencing the new field yet; existing indexes are preserved).
- Releases the write lock.
Computed defaults on backfill¶
When the new field's spec includes a default modifier, the rebuild walk applies it to every existing record:
| Modifier | Backfill behaviour |
|---|---|
:default=<literal> |
Stamped verbatim on every existing record. Goes through the same type-aware encoder used at insert time (int BE, varchar length prefix, numeric scaling, etc.). |
:default=seq(<name>) |
The server reserves a contiguous range [start, start+live_count) from the named sequence in one flock, then assigns the values sequentially as the walk progresses. After the rebuild, the next insert-time seq(<name>) call resumes from start+live_count. |
:default=uuid() |
Fresh UUIDv4 generated per record (/dev/urandom). For varchar:36 fields the canonical 36-char dashed string is stored; for the native uuid type the raw 16 bytes are stored. |
:default=random(<N>) |
Fresh N random bytes per record (/dev/urandom), hex-encoded to 2N characters. The request is refused pre-flight if 2N exceeds the field's storage cap — no rebuild is started in that case. |
:auto_create / :auto_update |
Inert during backfill. These are insert/update-time generators — the original record's creation timestamp is unknown, so stamping now() on every row would lie about history. Existing records keep zero bytes for this field; future inserts/updates fire the generator as normal. |
| (no modifier) | Existing records' bytes for the new field are zero. Decoders render that as the type's "absent" form (empty string for varchar, 0 for int, etc.). |
Notes¶
- Existing record count and hash routing are preserved.
- Full object rebuild — scales with object size. Not instantaneous on millions of records.
- Sequence allocation is exact at walk start: the live count is read from the kf-header summary on the legacy slotcask handle, which is the authoritative source of record counts in v2.
edit-field¶
Edit one or more existing fields in place — same-type only. Used to grow/shrink a varchar, widen/narrow an integer family field, change a numeric's scale, widen float → double, or append / rename / widen an enum.
{
"mode": "edit-field",
"dir": "<dir>",
"object": "<obj>",
"fields": [
"name:varchar:200",
"age:long",
"balance:numeric:18,4"
]
}
For enum renames, set "allow_rename": true at the top level — without it, any change at an existing enum position is rejected (renames are easy to typo and would silently relabel every existing record):
{
"mode": "edit-field",
"dir": "acme", "object": "items",
"fields": ["color:enum(crimson,green,blue,yellow)"],
"allow_rename": true
}
CLI shortcut (single-field — JSON form covers batch):
What changes are allowed¶
| Edit | Rule |
|---|---|
varchar:N → varchar:M |
Grow always allowed. Shrink refused pre-flight if any live record's content length exceeds M. |
Integer family (short ↔ int ↔ long) |
Widen always allowed (sign-extension preserves negatives). Narrow refused pre-flight if any live record's value falls outside the new type's [-2^(N×8-1), 2^(N×8-1) − 1] range. |
numeric:P,S1 → numeric:P,S2 |
Scale-up multiplies the stored int64 by 10^(S2−S1); refused pre-flight if any value would overflow int64. Scale-down divides and truncates toward zero (matches Postgres). |
float → double |
Always allowed; IEEE 754 widen, no validation needed. |
enum(a,b,c) → enum(a,b,c,d,…) (append) |
Always allowed. Existing records keep their byte index; new value gets the next index. No rebuild needed; only fields.conf is updated. |
enum(a,b,c) → enum(a,b,c,…257+ values) (auto-widen 1B → 2B) |
Allowed. Triggers a full record rebuild (zero-extends each record's byte index) and rewrites the bitmap with the wider encoding. |
enum(a,b,c) → enum(x,b,c) (rename at position) |
Requires "allow_rename": true in the request body. Existing records keep their byte index; the displayed value changes. Without the flag, refused. |
enum(a,b,c) → enum(a,b) (remove) |
Always refused. Records reference values by position; removing would corrupt every record using the dropped value. |
enum(a,b,c) → enum(c,a,b) (reorder) |
Refused (caught by the position-by-position diff — same shape as remove). |
enum(…) → enum(…) narrow 2B → 1B |
Refused. Records that hold an index ≥ 256 would lose data. |
Other cross-type edits are hard-refused with the hint: use add-field <new> + remove-field <old> + bulk-update and migrate the data explicitly.
What happens¶
- Takes the object's write lock (
objlock_wrlock). - Refuses immediately if
storage_version != 2(point to./migrate). - Parses every edit spec; refuses on unknown field name, tombstoned field, duplicate edit in the same request, invalid type, or cross-type change.
- Builds a new
TypedSchemaby overlaying each edited field onto a clone of the old schema (positions unchanged, onlysize/offset/numeric_scalemove). - No-op fast path — if no field's encoding actually changed (a varchar staying the same size with only
:default=...modifier shift, say), skip the data rebuild and rewritefields.confonly. - Otherwise pre-flight scan: walks every live record across all keyfile shards and verifies each edited field's value fits the new shape. First violation aborts with
{"error":"Pre-flight failed on field [<name>]: <reason>"}— no data or schema change. - Rebuild: runs the same v2 rebuild path used by
add-field/vacuum --compact, but withtransform_field_value()re-encoding the edited fields per record. Atomic — the legacy data tree is preserved until the rebuild succeeds. - Rewrites
fields.confto lock in the new spec. - Smart reindex: walks
index.confand only rebuilds indexes whose referenced fields actually changed encoding. Indexes referencing untouched fields are skipped — the response carriesindexes_skipped:Nalongsideindexes_rebuilt:Nso operators can verify. - Releases the write lock.
Response¶
No-op fast path returns {"status":"edited","fields":N,"rebuilt":false} — fields.conf updated but no data rebuild ran.
dry_run¶
Pass "dry_run":true to run every validation step (cross-type refusal, FT_ENUM prefix check, per-record varchar overflow + integer narrowing pre-flight) without writing anything.
would_rebuild:false means every edit is encoding-equivalent — the change would only touch fields.conf (e.g. carrying through a new default modifier on an unchanged type). Useful before running a same-type narrow on a large object.
Notes¶
- Default modifier carry-through: when a new edit spec OMITS
:default=.../:auto_create/:auto_update, the modifier from the OLDfields.confline is preserved —edit-field age:longagainst an existingage:int:default=42keeps:default=42. - Changing a default: include the new modifier on the edit spec, e.g.
edit-field age:int:default=99. Supported forms::default=<literal>,:default=seq(<name>),:default=uuid(),:default=random(N),:auto_create,:auto_update. The new modifier affects future inserts only — existing records keep their stored values. For a one-shot backfill of existing records, useadd-field <new-name>with the modifier, thenbulk-updateandremove-field <old-name>. - Indexed fields: a varchar grow that doesn't shrink content, an integer widen, or any encoding-changing edit on an indexed field rebuilds only that index (smart reindex). Queries on the indexed field continue to resolve to the same records post-edit.
- Full object rebuild — scales with object size, not slot count. Not instantaneous on millions of records. Holds the wrlock for the duration.
rename-field¶
Metadata-only; no data rewrite.
What happens¶
- Rewrites
fields.confwith the new name. - Rewrites
indexes/index.confand renamesindexes/<old>/→indexes/<new>/(per-shard directory rename — allNNN.idxfiles travel with the rename, no rebuild required). - For composite indexes, rewrites parts that reference the old name and renames the composite directory accordingly.
- Invalidates caches.
Constraints¶
- Both names must pass
valid_field_name()(no:,+,/, spaces). newmust not collide with an existing field.
Response: {"status":"renamed","old":"...","new":"..."}.
remove-field¶
Tombstones one or more fields. Bytes stay reserved in every record's payload until vacuum --compact runs.
{
"mode": "remove-field",
"dir": "<dir>",
"object": "<obj>",
"fields": ["legacy_status","deprecated_flag"]
}
What happens¶
- Appends
:removedto each field's line infields.conf. - Auto-drops any index referencing a removed field (including composites).
- Queries treating those fields return empty values; writes silently ignore them.
Why tombstone vs delete¶
- Cheap: no data rewrite until vacuum.
- Reversible (soft): re-adding the same field via
add-fieldcreates a new column; the old tombstoned bytes stay reserved until compact.
Reclaiming the bytes¶
Response from remove-field: {"status":"removed","fields":N,"indexes_dropped":N}.
vacuum¶
Maintenance — reclaim deleted-record slots, drop tombstoned fields, or reshard.
Flavors¶
| Call | What it does |
|---|---|
{"mode":"vacuum",...} |
Direction-C seg compaction — sparse non-active seg files are pair-merged into denser ones via kf-repoint, then unlinked. Active seg of each stream is never touched. Also: if the host's CPU count has changed since create-object and slotcask_streams_for_nproc() no longer matches schema.streams, the call automatically promotes to a full rebuild that re-routes records into the new stream layout. Idempotent. |
{"mode":"vacuum","compact":true} |
Full rebuild. Drops tombstoned fields, shrinks slot_size. Indexes preserved. |
{"mode":"vacuum","splits":N} |
Full rebuild with a new shard count. Re-hashes data; hash routing identity is preserved. Triggers a full reindex — see below. Also folds in the streams-mismatch check on the same rebuild, so you never need a second call. |
{"mode":"vacuum","compact":true,"splits":N} |
Both — compact schema and reshard in one pass. |
Why splits triggers reindex (2026.05.1+)¶
Each indexed field is sharded into index_splits_for(splits) btree files (<obj>/indexes/<field>/<NNN>.idx). Changing splits changes the per-field shard count, so the on-disk NNN.idx files for the old layout become unreachable orphans. vacuum --splits calls reindex_object() after the data rebuild, which:
- Wipes every per-field idx directory (
indexes/<field>/). - Rebuilds each indexed field at the new
index_splits_for(splits)shard count. - Preserves the index list (
index.conf) — same indexes, fresh layout.
Plain vacuum --compact (no splits) leaves indexes alone — the per-field shard count doesn't change.
What triggers the need¶
- Many deletes →
vacuum-checkflags objects where tombstoned ≥10% and total ≥1000. - Removed fields →
compactto reclaim their bytes. - Shard skew from growth →
splits:Nto even out load. Seeshard-stats. - CPU upgrade or container resize (v2) → the next default
vacuumself-heals: it detectsslotcask_streams_for_nproc() ≠ schema.streamsand dispatches to the rebuild path, after which records route by the new stream count. The auto-vacuum thread picks this up automatically once any object also crosses the deletion-pct threshold; for an explicit fix, run./shard-db vacuum <dir> <obj>.
Locks¶
All vacuum flavors take the object's write lock. Normal ops block for the duration of the rebuild.
Response (v2 light path): {"status":"vacuumed","cleaned":<seg-files-dropped>}.
Response (heavy path — compact, splits, or streams-mismatch): {"status":"rebuilt","live":N,"splits":N,"streams":N,"slot_size":N,"compact":true|false,"indexes_rebuilt":N}.
truncate¶
Delete all records; schema and indexes survive.
- Fast: zeroes out every kf shard (slot array + header counters) and drops every seg file; resets total/deleted to 0. Schema and
fields.confstay intact. - Indexes are emptied.
- Field schema (including tombstones) stays intact.
Response: {"status":"truncated","object":"..."}.
recount¶
Rescans every shard, counts live/tombstoned slots, and rewrites metadata/counts. Use when size returns numbers that look wrong (e.g., after a crash or manual disk edit).
Response: {"count":N,"orphaned":M}.
backup¶
Copies the object's data/, indexes/, metadata/, and files/ directories into a timestamped snapshot under the same root.
Response: {"status":"backed_up","path":"<dir>/<obj>.backup-20260418T153012"}.
Snapshot is a point-in-time copy — in-flight writes after the copy starts may or may not be included. For consistent production backups, pause writes first (or accept best-effort consistency).
Lock model summary¶
| Mutation | Lock | Blocks |
|---|---|---|
add-field, remove-field, vacuum --compact, vacuum --splits |
objlock_wrlock |
All other ops on this object (reads + writes). |
rename-field |
objlock_wrlock |
Same. |
truncate |
objlock_wrlock |
Same. |
backup, recount |
objlock_rdlock |
Only schema mutations. |
| Normal CRUD / queries | objlock_rdlock |
Only schema mutations. |
Held only for the rebuild duration. For multi-second rebuilds, clients see temporarily-blocked queries; consider running these in a maintenance window.
See Concepts → Concurrency for the full locking story.