perf/correctness: emit hm (HMAC-256) at sv element level, not b3 (option 1)

## Summary

Post-PR cipherstash/encrypt-query-language#196, `eql_v2.hash_encrypted` is hmac-only and `eql_v2."="` is `eql_v2.hmac_256(a) = eql_v2.hmac_256(b)`. This is correct for root-level operations — root payloads carry `hm`, not `b3`. But `eql_v2."->"` and `eql_v2.jsonb_path_query_first` extract an sv element and re-present it as a root-level `eql_v2_encrypted` value. sv elements today carry `b3` (Blake3 selector-scoped equality) but no `hm`, so calling `=`, `GROUP BY`, `DISTINCT`, or any hash-strategy operation on the extracted value raises:

```
ERROR: Cannot hash eql_v2_encrypted value: no hmac_256 index term found.
       Configure a `unique` index on the column for hash operations.
```

The `docs/reference/sql-support.md` matrix at the time of writing claims object / array / boolean / null paths support equality (including `GROUP BY`) via `b3` — that documentation no longer matches reality and the gap is exactly this issue.

## Why patching `=` / `hash_encrypted` to fall back on `b3` is not the answer

Tested four shapes that add a `b3` fallback inside `eql_v2."="` and/or `eql_v2.hash_encrypted` — `CASE WHEN has_*()`, `coalesce(hmac eq, b3 eq)`, NULL-safe inlinable `hmac_256` + `coalesce`, plpgsql NULL-safe `hmac_256` + `coalesce`. All four fix correctness but every one breaks at least one of the two PostgreSQL planner optimisations that PR cipherstash/encrypt-query-language#196's body shape unlocks:

* **Functional index match** on `eql_v2.hmac_256(col)` — the structural pattern that bench_text_hmac_idx and equivalents rely on.
* **Merge Join sort-key hoist** — the planner extracts `eql_v2.hmac_256(e)::text` as the Sort key so it sorts on 32-byte hmac strings rather than the full 1.7 KB encrypted JSONB.

Concretely on a 10K-row JSON column with a GIN index, the patched shape v4 (NULL-safe inlinable `hmac_256` + `coalesce(...)` in `=`) takes self-JOIN from 344 ms → **85,421 ms** (250× regression). The other shapes are 30–130× regressions. There's no way to add a `b3` fallback to `=` without inserting a wrapper that defeats the planner's structural pattern matching.

## Proposal: emit `hm` (HMAC-256) instead of `b3` (Blake3) at the sv element level

Treat this as a crypto-layer change in `@cipherstash/protect` / proxy: stop emitting `b3` for ste_vec elements and emit `hm` in its place (or both, during transition).

### Why this works

* Every sv element carries `hm` after the change. After `eql_v2."->"` or `eql_v2.jsonb_path_query_first` extracts an sv element, the resulting `eql_v2_encrypted` has `hm` at root. PR cipherstash/encrypt-query-language#196's hmac-only `=` / `hash_encrypted` then work for field-level GROUP BY, DISTINCT, joins, etc. — no EQL code change needed.
* The planner's index-match and Merge Join sort-key hoist optimisations stay intact for the hot path.
* It makes the documentation matrix true again — all JSON node types support equality once a single deterministic index term (`hm`) is present.

### Why the HMAC-vs-Blake3 perf gap is acceptable

On ARM/NEON CPUs HMAC-SHA256 throughput is comparable to Blake3 — see [crypto-benches MACs results](<https://github.com/cipherstash/crypto-benches/blob/main/results/macbook-air-m4-15.md#macs>). On x86 with hardware SHA acceleration the gap is also narrow. The encryption-time cost of switching is small for typical workloads.

### Concrete bench result

Built a 10K-row table `bench_hm_in_sv` with the option-1 shape (sv element carries `hm`), against a clean PR cipherstash/encrypt-query-language#196 baseline (no EQL patches):


| Operation | bench_b3_only (today, b3 in sv) | **bench_hm_in_sv (option 1)** |
| -- | -- | -- |
| `@>` via GIN | n/a (works at root) | 0.24 ms |
| `WHERE e = $1` single-row | works at root | 13.57 ms |
| `GROUP BY e` root | works | 124.97 ms |
| `GROUP BY jsonb_path_query_first(e, '<sel>')` field | **RAISES** | **391 ms** |
| `DISTINCT e` | works | 122.42 ms |
| self-JOIN on root e | 344 ms | 470 ms |

The field-level GROUP BY that raises today runs at 391 ms — same ballpark as field-hm GROUP BY on bench_json. Self-JOIN stays in the 300–500 ms range (Merge Join + hmac sort-key hoist preserved); no catastrophic plan flip.

## EQL-side changes (small)

If the protect / proxy side starts emitting `hm` at sv element level, EQL needs:

1. `eql_v2.ste_vec_contains` element comparison: the fix in cipherstash/encrypt-query-language#196's review added the `has_blake3 → compare_blake3` guard. Replace with `has_hmac_256 → compare_hmac_256` (or keep both, preferring `hm` and falling through to `b3` for legacy data). Either way the contract — "selector-scoped equality on encrypted plaintext" — is unchanged.
2. **Backward-compat reads**: during the transition window (old data has `b3`, new data has `hm`), `ste_vec_contains` and any other consumer of sv-element equality should accept either term. A `coalesce(hm-compare, b3-compare)` shape works *here* (it's an internal call, not the hot-path `=` operator — no planner regression).
3. **Documentation refresh**: update `docs/reference/sql-support.md` matrix to describe `hm` as the equality term for all JSON node types (Object, Array, Boolean, Null, String, Number — the latter two already had it via the unique-index pathway). Remove the "b3 supports GROUP BY only for object/array/etc" caveat. Add an upgrade note explaining the field-level GROUP BY contract change.

## Migration plan (concrete)

1. Protect / proxy: start dual-emitting (`b3` + `hm`) at sv elements. Existing readers (ste_vec_contains via b3) keep working; new readers (anything going through `=` / `hash_encrypted` post-extraction) prefer `hm`.
2. EQL: land the `ste_vec_contains` change above with the `coalesce(hm, b3)` shape so it reads either.
3. After a transition window, protect/proxy stops emitting `b3` for new data. Existing data still has b3; that's fine — the EQL coalesce reads it.
4. (Optional, longer term) drop `b3` entirely once all customer data is migrated. Not required.

## Verification

In-repo bench coverage for this scenario is already drafted in [#203](<https://github.com/cipherstash/encrypt-query-language/issues/203>) (companion PR adding GROUP BY / JOIN / DISTINCT bench tests). The `bench_json_data.sql` fixture overlays `hm` at the `$.hello` selector — exactly the option-1 shape. The field-level GROUP BY plan + regression tests pass with PR cipherstash/encrypt-query-language#196's untouched `=`, no patches needed.

When this lands, the regression tests' `#[ignore = "#202: ..."]` markers on the hash-strategy timing assertions stay relevant for the root-level `hash_encrypted` fast-path (a separate concern), but the field-level test (`group_by_jsonb_field_under_threshold`) becomes about end-to-end JSON field aggregation perf — which is what we actually want it to measure.

## Related

* cipherstash/encrypt-query-language#189 / [#191](<https://github.com/cipherstash/encrypt-query-language/issues/191>) — `like`/`ilike` IMMUTABLE flip (the original Phase 1 of this theme).
* cipherstash/encrypt-query-language#196 — Phase 1 operator inlining + payload scheme tightening (where root-hmac-only was introduced).
* [#201](<https://github.com/cipherstash/encrypt-query-language/issues/201>) — `~~` two-level inlining for bare LIKE.
* cipherstash/encrypt-query-language#202 — `hash_encrypted` fast-path on root `hm` (separate perf concern, root-level).
* [#203](<https://github.com/cipherstash/encrypt-query-language/issues/203>) — bench coverage (companion).
* cipherstash/encrypt-query-language#204 — JSONB field extractor inlining (separate Phase 3, for `->` / `jsonb_path_query_first` per-row cost).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf/correctness: emit hm (HMAC-256) at sv element level, not b3 (option 1) #205

Summary

Why patching `=` / `hash_encrypted` to fall back on `b3` is not the answer

Proposal: emit `hm` (HMAC-256) instead of `b3` (Blake3) at the sv element level

Why this works

Why the HMAC-vs-Blake3 perf gap is acceptable

Concrete bench result

EQL-side changes (small)

Migration plan (concrete)

Verification

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	bench_b3_only (today, b3 in sv)	bench_hm_in_sv (option 1)
`@>` via GIN	n/a (works at root)	0.24 ms
`WHERE e = $1` single-row	works at root	13.57 ms
`GROUP BY e` root	works	124.97 ms
`GROUP BY jsonb_path_query_first(e, '<sel>')` field	RAISES	391 ms
`DISTINCT e`	works	122.42 ms
self-JOIN on root e	344 ms	470 ms

perf/correctness: emit hm (HMAC-256) at sv element level, not b3 (option 1) #205

Description

Summary

Why patching = / hash_encrypted to fall back on b3 is not the answer

Proposal: emit hm (HMAC-256) instead of b3 (Blake3) at the sv element level

Why this works

Why the HMAC-vs-Blake3 perf gap is acceptable

Concrete bench result

EQL-side changes (small)

Migration plan (concrete)

Verification

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Why patching `=` / `hash_encrypted` to fall back on `b3` is not the answer

Proposal: emit `hm` (HMAC-256) instead of `b3` (Blake3) at the sv element level