Skip to content

shard-db 2026.05.6

Hotfix for two latent JSON-escape bugs in the typed-record path plus the addition of a new timestamp field type. The JSON-escape bugs surface whenever varchar field bytes contain ", \, or control characters (< 0x20); any caller round-tripping such data through insertfind / get saw silent corruption pre-fix. HN comment text is a typical example, which is how this got caught while building the public showcase. The timestamp type addresses the gap between datetime (calendar-packed) and long (untyped int64) — Unix epoch ms in 8 bytes with auto-time generators.

What changed

Output side. decode_field_to_buf (config.c) and buf_field_value (query.c) used to write varchar content into JSON output using raw "%.*s", so a stored ", \, or newline silently corrupted the response stream — clients hit Expected ',' or '}' mid-response. A new json_escape_into() helper in util.c does RFC 8259-compliant escaping (worst-case 6× expansion as \u00XX); the two FT_VARCHAR emit sites now route through it. typed_decode and typed_decode_stream switched from a fixed 512-byte stack vbuf to a heap allocation sized to the field's worst case so large varchars (e.g. varchar:8192) don't truncate when escapes expand them.

Input side. typed_encode and typed_encode_defaults used to strip surrounding quotes from a JSON string value and then copy the inner bytes verbatim, preserving wire-form escapes (\", \n, \\, \uXXXX…) as literal byte sequences on disk. Now both functions route FT_VARCHAR values through the pre-existing json_unescape_string() so stored bytes match what the wire intended. Other typed fields (int / long / bool / date / datetime / numeric / float / double / byte) are unchanged — their literal forms have no escapes to decode.

CSV path is untouched. cmd_bulk_insert_delimited reads spans from the mmap'd CSV buffer and stores them as raw bytes; CSV has no escape semantics that align with JSON. Behaviour is unchanged there.

Outer-buffer sizing in typed_decode. Pre-2026.05.6 the outer per-record JSON buffer was sized as a flat nfields * 300 heuristic, which silently truncated mid-value on records carrying a varchar field whose content approached its declared capacity (e.g. comments.text:varchar:8192 with multi-KB HN comments). SB_APPEND doesn't fail-loud — it just stops writing. Now sized per-field by type: 6 * (size - 2) + 2 for varchar (worst-case 6x escape expansion), 64 bytes for everything else.

New: timestamp field type

Filed in ~/shard-db-issue-drafts/02-timestamp.md 2026-04-30, implemented here. Eight bytes, signed int64 big-endian, semantic "Unix epoch milliseconds". On-disk layout, comparison, and index key are identical to FT_LONG. What's new is the auto-time generator path:

created_at:timestamp                  # explicit ms-since-epoch on insert
created_at:timestamp:auto_create      # server stamps Date.now() on INSERT only
updated_at:timestamp:auto_update      # server stamps Date.now() on every INSERT + UPDATE

Wire shape on read is a plain integer:

{"created_at": 1779159122551}

Why not just use long? long works (and many users already do), but the type carries no semantic — clients can't tell from describe-object whether a long field is a timestamp or just a number, and there's no :auto_create / :auto_update shorthand. timestamp closes both gaps.

Why not datetime? datetime is calendar-packed (yyyyMMddHHmmss as a 6-byte composite) — fine for "user picked 2026-05-19 14:30 on a form", but can't represent pre-1970 or post-9999 dates and forces a format conversion every time the client's wire format is Unix-epoch (which is most JSON event shapes, log pipelines, telemetry, and the HN API).

The two coexist: datetime for human-calendar timestamps, timestamp for machine-event timestamps. Both have legitimate use cases.

Compatibility

  • Wire format unchanged.
  • On-disk schema unchanged. No reindex, no migrate.
  • Existing varchar records that contain JSON metacharacters are still on disk in their pre-fix shape. They continue to read as the same corrupted-but-stable form they did before — the encode-side fix only affects records written after the upgrade. If you need clean data, re-ingest through your own pipeline.

Upgrading

./shard-db stop
# replace shard-db / shard-cli / migrate with the 2026.05.6 binaries
./shard-db start

No ./migrate step is required for this release. Existing v2 data files load unchanged.

If you run shard-db under systemd / supervisord / docker, swap the ./shard-db stop and ./shard-db start lines for your service manager's equivalents.

Test coverage

  • New test-json-escape (19 assertions) — ", \, raw \n, raw \t round-trips across single get, multi-get dict, and find array response shapes; plus a 4 KB varchar value to lock in the outer-buffer sizing fix.
  • New test-timestamp (14 assertions) — explicit insert, auto_create on insert, auto_update advancing on update, indexed range queries, describe-object reporting timestamp.
  • Full suite: 81 cases / 3091 assertions, 0 failures.