(improvement) perf: Vector float fast paths (subset of https://github.com/scylladb/gocql/pull/770 !)#744
(improvement) perf: Vector float fast paths (subset of https://github.com/scylladb/gocql/pull/770 !)#744mykaul wants to merge 2 commits intoscylladb:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves vector marshaling/unmarshaling performance for common float vector types in the GoCQL driver by adding specialized fast paths, backed by new unit tests to validate correctness and key behavioral expectations.
Changes:
- Add fast-path marshal/unmarshal implementations for
vector<double>([]float64) andvector<float>([]float32) that bypass reflect and per-element dispatch. - Add unit tests covering round-trips, byte-compatibility, nil handling, dimension mismatches, and slice-reuse behavior for float vectors.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
marshal.go |
Introduces float32/float64 vector fast paths and helper encode/decode functions using big-endian IEEE754 layout. |
marshal_vector_test.go |
Adds unit tests validating correctness, edge cases, and allocation behavior for float vector fast paths. |
| // makeDoubleVectorType creates a VectorType for vector<double> with the given dimension. | ||
| func makeDoubleVectorType(dim int) VectorType { | ||
| return VectorType{ | ||
| NativeType: NativeType{proto: protoVersion4, typ: TypeCustom}, | ||
| SubType: NativeType{proto: protoVersion4, typ: TypeDouble}, | ||
| Dimensions: dim, | ||
| } | ||
| } | ||
|
|
||
| // makeFloat32VectorType creates a VectorType for vector<float> with the given dimension. | ||
| func makeFloat32VectorType(dim int) VectorType { | ||
| return VectorType{ | ||
| NativeType: NativeType{proto: protoVersion4, typ: TypeCustom}, | ||
| SubType: NativeType{proto: protoVersion4, typ: TypeFloat}, | ||
| Dimensions: dim, | ||
| } |
There was a problem hiding this comment.
The makeDoubleVectorType / makeFloat32VectorType helpers construct a VectorType with NativeType{typ: TypeCustom} but leave custom empty. Other vector-related tests/benchmarks in this repo set custom to the Cassandra vector type prefix/spec, which makes error messages and String() output much more representative. Consider setting custom (e.g., at least apacheCassandraTypePrefix+"VectorType", or the full VectorType(<subtype>, <dim>) spec) to better match real parsed schema types.
| // Fast paths for []float64/[]float32 — bypass reflect and per-element Marshal dispatch. | ||
| switch info.SubType.Type() { | ||
| case TypeDouble: | ||
| switch v := value.(type) { | ||
| case []float64: | ||
| if v == nil { | ||
| return nil, nil | ||
| } | ||
| return marshalVectorFloat64(info.Dimensions, v) | ||
| case *[]float64: | ||
| if v == nil || *v == nil { | ||
| return nil, nil | ||
| } | ||
| return marshalVectorFloat64(info.Dimensions, *v) | ||
| } | ||
| case TypeFloat: | ||
| switch v := value.(type) { | ||
| case []float32: | ||
| if v == nil { | ||
| return nil, nil | ||
| } | ||
| return marshalVectorFloat32(info.Dimensions, v) | ||
| case *[]float32: | ||
| if v == nil || *v == nil { | ||
| return nil, nil | ||
| } | ||
| return marshalVectorFloat32(info.Dimensions, *v) | ||
| } |
There was a problem hiding this comment.
marshalVector now accepts *[]float64 / *[]float32 only via the float fast paths, while the generic vector path still rejects pointer-to-slice values. Since Marshal() already dereferences pointers before calling marshalVector, this special-casing is a bit inconsistent and may be surprising for any internal callers that use marshalVector directly. Consider either removing the pointer cases (and testing pointer behavior via Marshal()), or adding a small pointer-dereference step at the start of marshalVector so pointer inputs behave consistently for all vector subtypes.
350506b to
aecae6c
Compare
Add direct type-assertion fast paths in marshalVector and unmarshalVector for []float64 and []float32. These bypass the reflect-based generic path and per-element Marshal/Unmarshal dispatch chain, using tight binary.BigEndian loops instead. Pointer-to-slice (*[]float64/*[]float32) is not handled in marshalVector because the top-level Marshal() function already dereferences pointers before dispatch. This keeps pointer handling in one place. Marshal: pre-allocate exact output buffer, encode via PutUint64/PutUint32. Unmarshal: reuse destination slice when capacity suffices (zero allocs). Note: the generic reflect-based path always allocates a new slice; this intentional divergence is what delivers the zero-alloc steady-state for repeated unmarshal calls. 768-dim float32 results (benchstat, n=5): Marshal: 30.3µs → 1.4µs (-95.3%, 1544 → 2 allocs) Unmarshal: 25.1µs → 631ns (-97.5%, 2 → 0 allocs) Throughput exceeds 1.5 GB/s marshal, 4.5 GB/s unmarshal for float32 vectors — roughly 15-35x faster than the generic reflect path.
Add comprehensive unit tests in marshal_vector_test.go covering all code paths in the vector float fast-path implementation: - Round-trip marshal/unmarshal for []float64 and []float32 - Byte-level wire format compatibility (big-endian IEEE 754) - Slice reuse on repeated unmarshal (cap == dim and cap > dim) - Nil data unmarshal with both nil and non-nil destination slices - Nil slice and nil pointer marshal (typed nil, *[]T nil, *T→nil slice) - Dimension mismatch errors on marshal and data length errors on unmarshal - Empty vector (dim=0) round-trip - Pointer-to-slice (*[]float64/*[]float32) marshal via Marshal() API - Special IEEE 754 values: ±Inf, NaN, MaxFloat, SmallestNonzero, -0.0 Test helpers (makeDoubleVectorType, makeFloat32VectorType) now set the custom field on NativeType for consistency with makeFloatVectorType in vector_bench_test.go. Pointer-to-slice tests now go through the public Marshal() API rather than calling marshalVector() directly, validating the real code path where Marshal() dereferences pointers before dispatch. 10 test functions with 38 subtests total.
aecae6c to
5814a13
Compare
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.
Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
dedicated functions for []float32, []float64, []int32, []int64
before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
for zero-alloc steady state when callers return buffers after
the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
for non-fast-path fixed-size types (e.g. UUID, timestamp).
Benchmark results for vector<float, 1536> (typical embedding dimension):
Marshal (baseline -> optimized):
86.4 us/op -> 3.4 us/op (25x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 6172 B/op (78% less memory)
Marshal with pool return (steady state):
86.4 us/op -> 1.6 us/op (54x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 48 B/op (99.8% less memory)
Unmarshal (baseline -> optimized):
60.2 us/op -> 1.5 us/op (41x faster)
2 allocs -> 0 allocs (100% fewer)
6168 B/op -> 0 B/op (100% less memory)
Round-trip (baseline -> optimized, pooled):
147.8 us/op -> 3.1 us/op (48x faster)
3083 allocs -> 2 allocs (99.94% fewer)
34800 B/op -> 48 B/op (99.9% less memory)
Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)
New test files:
- marshal_vector_test.go: 58 unit subtests across 13 categories
(round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
empty-vector, pointer-to-slice, special-values, pool-concurrency,
oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
int32/int64 marshal/unmarshal.
Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.
Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
dedicated functions for []float32, []float64, []int32, []int64
before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
for zero-alloc steady state when callers return buffers after
the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
for non-fast-path fixed-size types (e.g. UUID, timestamp).
Benchmark results for vector<float, 1536> (typical embedding dimension):
Marshal (baseline -> optimized):
86.4 us/op -> 3.4 us/op (25x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 6172 B/op (78% less memory)
Marshal with pool return (steady state):
86.4 us/op -> 1.6 us/op (54x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 48 B/op (99.8% less memory)
Unmarshal (baseline -> optimized):
60.2 us/op -> 1.5 us/op (41x faster)
2 allocs -> 0 allocs (100% fewer)
6168 B/op -> 0 B/op (100% less memory)
Round-trip (baseline -> optimized, pooled):
147.8 us/op -> 3.1 us/op (48x faster)
3083 allocs -> 2 allocs (99.94% fewer)
34800 B/op -> 48 B/op (99.9% less memory)
Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)
New test files:
- marshal_vector_test.go: 58 unit subtests across 13 categories
(round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
empty-vector, pointer-to-slice, special-values, pool-concurrency,
oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
int32/int64 marshal/unmarshal.
Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.
Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
dedicated functions for []float32, []float64, []int32, []int64
before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
for zero-alloc steady state when callers return buffers after
the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
for non-fast-path fixed-size types (e.g. UUID, timestamp).
Benchmark results for vector<float, 1536> (typical embedding dimension):
Marshal (baseline -> optimized):
86.4 us/op -> 3.4 us/op (25x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 6172 B/op (78% less memory)
Marshal with pool return (steady state):
86.4 us/op -> 1.6 us/op (54x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 48 B/op (99.8% less memory)
Unmarshal (baseline -> optimized):
60.2 us/op -> 1.5 us/op (41x faster)
2 allocs -> 0 allocs (100% fewer)
6168 B/op -> 0 B/op (100% less memory)
Round-trip (baseline -> optimized, pooled):
147.8 us/op -> 3.1 us/op (48x faster)
3083 allocs -> 2 allocs (99.94% fewer)
34800 B/op -> 48 B/op (99.9% less memory)
Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)
New test files:
- marshal_vector_test.go: 58 unit subtests across 13 categories
(round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
empty-vector, pointer-to-slice, special-values, pool-concurrency,
oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
int32/int64 marshal/unmarshal.
Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.
Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
dedicated functions for []float32, []float64, []int32, []int64
before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
for zero-alloc steady state when callers return buffers after
the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
for non-fast-path fixed-size types (e.g. UUID, timestamp).
Benchmark results for vector<float, 1536> (typical embedding dimension):
Marshal (baseline -> optimized):
86.4 us/op -> 3.4 us/op (25x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 6172 B/op (78% less memory)
Marshal with pool return (steady state):
86.4 us/op -> 1.6 us/op (54x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 48 B/op (99.8% less memory)
Unmarshal (baseline -> optimized):
60.2 us/op -> 1.5 us/op (41x faster)
2 allocs -> 0 allocs (100% fewer)
6168 B/op -> 0 B/op (100% less memory)
Round-trip (baseline -> optimized, pooled):
147.8 us/op -> 3.1 us/op (48x faster)
3083 allocs -> 2 allocs (99.94% fewer)
34800 B/op -> 48 B/op (99.9% less memory)
Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)
New test files:
- marshal_vector_test.go: 58 unit subtests across 13 categories
(round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
empty-vector, pointer-to-slice, special-values, pool-concurrency,
oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
int32/int64 marshal/unmarshal.
Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.
Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
dedicated functions for []float32, []float64, []int32, []int64
before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
for zero-alloc steady state when callers return buffers after
the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
for non-fast-path fixed-size types (e.g. UUID, timestamp).
- vectorByteSize() helper guards against integer overflow on 32-bit
platforms with corrupt or adversarial schema metadata.
- All fast-path errors are wrapped as MarshalError/UnmarshalError
for consistent error typing.
- dim=0 vectors correctly encode as non-nil empty values (not CQL NULL)
in both fast paths and generic path.
- Negative dimensions are rejected with clear error messages.
Benchmark results for vector<float, 1536> (typical embedding dimension):
Marshal (baseline -> optimized):
86.4 us/op -> 3.4 us/op (25x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 6172 B/op (78% less memory)
Marshal with pool return (steady state):
86.4 us/op -> 1.6 us/op (54x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 48 B/op (99.8% less memory)
Unmarshal (baseline -> optimized):
60.2 us/op -> 1.5 us/op (41x faster)
2 allocs -> 0 allocs (100% fewer)
6168 B/op -> 0 B/op (100% less memory)
Round-trip (baseline -> optimized, pooled):
147.8 us/op -> 3.1 us/op (48x faster)
3083 allocs -> 2 allocs (99.94% fewer)
34800 B/op -> 48 B/op (99.9% less memory)
Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)
New test files:
- marshal_vector_test.go: 58+ unit subtests across 13 categories
(round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
empty-vector, pointer-to-slice, special-values, pool-concurrency,
oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
int32/int64 marshal/unmarshal.
Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.
Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
dedicated functions for []float32, []float64, []int32, []int64
before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
for zero-alloc steady state when callers return buffers after
the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
for non-fast-path fixed-size types (e.g. UUID, timestamp).
- vectorByteSize() helper guards against integer overflow on 32-bit
platforms with corrupt or adversarial schema metadata.
- All fast-path errors are wrapped as MarshalError/UnmarshalError
for consistent error typing.
- dim=0 vectors correctly encode as non-nil empty values (not CQL NULL)
in both fast paths and generic path.
- Negative dimensions are rejected with clear error messages.
Benchmark results for vector<float, 1536> (typical embedding dimension):
Marshal (baseline -> optimized):
86.4 us/op -> 3.4 us/op (25x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 6172 B/op (78% less memory)
Marshal with pool return (steady state):
86.4 us/op -> 1.6 us/op (54x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 48 B/op (99.8% less memory)
Unmarshal (baseline -> optimized):
60.2 us/op -> 1.5 us/op (41x faster)
2 allocs -> 0 allocs (100% fewer)
6168 B/op -> 0 B/op (100% less memory)
Round-trip (baseline -> optimized, pooled):
147.8 us/op -> 3.1 us/op (48x faster)
3083 allocs -> 2 allocs (99.94% fewer)
34800 B/op -> 48 B/op (99.9% less memory)
Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)
New test files:
- marshal_vector_test.go: 58+ unit subtests across 13 categories
(round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
empty-vector, pointer-to-slice, special-values, pool-concurrency,
oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
int32/int64 marshal/unmarshal.
Subsumes PR #744 (float fast paths) and PR #745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
This is a major improvement to float types marshal / unmarshal code paths in order to improve vector arrays performance in the driver.
It begins with a commit that adds unit tests to the code, and then adds the specialization for specific common types of vector arrays.
The performance is immensely improved.
There are additional, less impactful performance improvements to the overall execution of marshal/unmarshal paths of this and other types, I'll submit them separately.
Perf numbers:
768-dim float32 results (benchstat, n=5):
Marshal: 30.3µs → 1.4µs (-95.3%, 1544 → 2 allocs)
Unmarshal: 25.1µs → 631ns (-97.5%, 2 → 0 allocs)
Throughput exceeds 1.5 GB/s marshal, 4.5 GB/s unmarshal for float32
vectors — roughly 15-35x faster than the generic reflect path.