Backup¶
Three overlapping strategies, each covering a different failure mode. Pick at least two.
1. Per-object logical backup¶
Use the built-in backup command.
Or via JSON:
What it copies¶
Everything for that object: data/, indexes/, metadata/, files/, and fields.conf. Destination is a timestamped sibling directory under $DB_ROOT/<dir>/:
When to use¶
- Pre-schema-migration checkpoint (before
add-field/remove-fieldon production data). - Before a risky
vacuum --splits Nor manual shard edit. - One-off snapshots of a single object for export.
Consistency¶
Copy happens under a read lock, so concurrent reads see a consistent view. Writes during the copy may or may not be included — it's "recent" but not a fenced snapshot. For strict point-in-time, pause writes first or use filesystem snapshots.
Restore¶
No automatic restore command. To restore:
- Stop the server.
- Remove or rename the current
<obj>/directory. mv <obj>.backup-TIMESTAMP <obj>.- Start the server.
The timestamped directory is a drop-in replacement — same layout.
2. Filesystem snapshots¶
The strongest option for whole-database backup. Use ZFS, LVM, btrfs, or a SAN snapshot.
ZFS example¶
# Hourly snapshot retained for 24 hours
zfs snapshot tank/shard-db@hourly-$(date +%Y%m%dT%H)
zfs list -t snapshot tank/shard-db
# Prune snapshots older than a day
zfs list -H -o name -t snapshot tank/shard-db \
| while read snap; do
ts=$(echo "$snap" | awk -F@ '{print $2}' | sed 's/hourly-//')
if [[ "$ts" < "$(date -d '24 hours ago' +%Y%m%dT%H)" ]]; then
zfs destroy "$snap"
fi
done
Why it's the gold standard¶
- Crash-consistent — the snapshot captures whatever state the kernel had flushed at the moment, which is exactly what shard-db recovers from on crash anyway.
- Instant — COW, no copy cost.
- Cross-object — captures every tenant + object + log state atomically.
Restore¶
Revert via zfs rollback (destroys later snapshots) or clone to a new filesystem for selective restore:
zfs clone tank/shard-db@hourly-20260418T14 tank/shard-db-restore
# mount tank/shard-db-restore and copy the bits you want
3. Offsite replication (cold)¶
For disaster recovery beyond a single host.
Option A — rsync the whole root¶
Cheap, simple, good enough when a few minutes of data loss is acceptable:
Run nightly from cron. -aAX preserves permissions, ACLs, xattrs. --delete prunes removed objects on the destination.
Caveat: rsync during active writes can race with shard growth (.new files mid-rename). Either pause writes, run on a filesystem snapshot, or accept occasional .new artifacts on the destination (they get swept on server startup).
Option B — block-level replication¶
DRBD, GlusterFS replicated volume, or a SAN replication product. Higher complexity, near-zero RPO.
shard-db has no native replication — there's no leader/follower streaming protocol. If you need that, layer it at the block or filesystem level.
Schedule¶
Suggested baseline:
| Layer | Cadence | Retention |
|---|---|---|
Per-object backup |
Ad-hoc, pre-migration | Keep until migration confirmed |
| Filesystem snapshot | Hourly | 24 h |
| Filesystem snapshot | Daily | 14 d |
Offsite rsync |
Daily | 30 d |
| Full cold archive | Weekly | 52 w |
Tune to your RPO and budget.
Testing restores¶
A backup you haven't restored is not a backup.
- Once a month, spin up a scratch instance on a restore of yesterday's snapshot.
- Run a handful of reads to verify object integrity.
- Run
vacuum-check— orphan counts should match what the live system reports.
What shard-db already does on its own¶
- Atomic writes — every slot flip is two-phase (write payload → flip flag).
- Atomic rebuilds —
.new+rename()for growth, vacuum, schema mutations. - Startup sweep — stale
.new/.oldartifacts are removed before accepting connections.
These protect against single-process crashes. They do not protect against disk failure, data center loss, or administrator error — hence the need for explicit backups above.