(improvement) perf: Vector float fast paths (subset of https://github.com/scylladb/gocql/pull/770 !) by mykaul · Pull Request #744 · scylladb/gocql

mykaul · 2026-03-07T09:50:28Z

This is a major improvement to float types marshal / unmarshal code paths in order to improve vector arrays performance in the driver.
It begins with a commit that adds unit tests to the code, and then adds the specialization for specific common types of vector arrays.

The performance is immensely improved.
There are additional, less impactful performance improvements to the overall execution of marshal/unmarshal paths of this and other types, I'll submit them separately.

Perf numbers:
768-dim float32 results (benchstat, n=5):
Marshal: 30.3µs → 1.4µs (-95.3%, 1544 → 2 allocs)
Unmarshal: 25.1µs → 631ns (-97.5%, 2 → 0 allocs)

Throughput exceeds 1.5 GB/s marshal, 4.5 GB/s unmarshal for float32
vectors — roughly 15-35x faster than the generic reflect path.

Copilot

Pull request overview

This PR improves vector marshaling/unmarshaling performance for common float vector types in the GoCQL driver by adding specialized fast paths, backed by new unit tests to validate correctness and key behavioral expectations.

Changes:

Add fast-path marshal/unmarshal implementations for vector<double> ([]float64) and vector<float> ([]float32) that bypass reflect and per-element dispatch.
Add unit tests covering round-trips, byte-compatibility, nil handling, dimension mismatches, and slice-reuse behavior for float vectors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
`marshal.go`	Introduces float32/float64 vector fast paths and helper encode/decode functions using big-endian IEEE754 layout.
`marshal_vector_test.go`	Adds unit tests validating correctness, edge cases, and allocation behavior for float vector fast paths.

Copilot · 2026-03-07T09:54:45Z

+// makeDoubleVectorType creates a VectorType for vector<double> with the given dimension.
+func makeDoubleVectorType(dim int) VectorType {
+	return VectorType{
+		NativeType: NativeType{proto: protoVersion4, typ: TypeCustom},
+		SubType:    NativeType{proto: protoVersion4, typ: TypeDouble},
+		Dimensions: dim,
+	}
+}
+
+// makeFloat32VectorType creates a VectorType for vector<float> with the given dimension.
+func makeFloat32VectorType(dim int) VectorType {
+	return VectorType{
+		NativeType: NativeType{proto: protoVersion4, typ: TypeCustom},
+		SubType:    NativeType{proto: protoVersion4, typ: TypeFloat},
+		Dimensions: dim,
+	}


The makeDoubleVectorType / makeFloat32VectorType helpers construct a VectorType with NativeType{typ: TypeCustom} but leave custom empty. Other vector-related tests/benchmarks in this repo set custom to the Cassandra vector type prefix/spec, which makes error messages and String() output much more representative. Consider setting custom (e.g., at least apacheCassandraTypePrefix+"VectorType", or the full VectorType(<subtype>, <dim>) spec) to better match real parsed schema types.

Copilot · 2026-03-07T09:54:45Z

+	// Fast paths for []float64/[]float32 — bypass reflect and per-element Marshal dispatch.
+	switch info.SubType.Type() {
+	case TypeDouble:
+		switch v := value.(type) {
+		case []float64:
+			if v == nil {
+				return nil, nil
+			}
+			return marshalVectorFloat64(info.Dimensions, v)
+		case *[]float64:
+			if v == nil || *v == nil {
+				return nil, nil
+			}
+			return marshalVectorFloat64(info.Dimensions, *v)
+		}
+	case TypeFloat:
+		switch v := value.(type) {
+		case []float32:
+			if v == nil {
+				return nil, nil
+			}
+			return marshalVectorFloat32(info.Dimensions, v)
+		case *[]float32:
+			if v == nil || *v == nil {
+				return nil, nil
+			}
+			return marshalVectorFloat32(info.Dimensions, *v)
+		}


marshalVector now accepts *[]float64 / *[]float32 only via the float fast paths, while the generic vector path still rejects pointer-to-slice values. Since Marshal() already dereferences pointers before calling marshalVector, this special-casing is a bit inconsistent and may be surprising for any internal callers that use marshalVector directly. Consider either removing the pointer cases (and testing pointer behavior via Marshal()), or adding a small pointer-dereference step at the start of marshalVector so pointer inputs behave consistently for all vector subtypes.

Add direct type-assertion fast paths in marshalVector and unmarshalVector for []float64 and []float32. These bypass the reflect-based generic path and per-element Marshal/Unmarshal dispatch chain, using tight binary.BigEndian loops instead. Pointer-to-slice (*[]float64/*[]float32) is not handled in marshalVector because the top-level Marshal() function already dereferences pointers before dispatch. This keeps pointer handling in one place. Marshal: pre-allocate exact output buffer, encode via PutUint64/PutUint32. Unmarshal: reuse destination slice when capacity suffices (zero allocs). Note: the generic reflect-based path always allocates a new slice; this intentional divergence is what delivers the zero-alloc steady-state for repeated unmarshal calls. 768-dim float32 results (benchstat, n=5): Marshal: 30.3µs → 1.4µs (-95.3%, 1544 → 2 allocs) Unmarshal: 25.1µs → 631ns (-97.5%, 2 → 0 allocs) Throughput exceeds 1.5 GB/s marshal, 4.5 GB/s unmarshal for float32 vectors — roughly 15-35x faster than the generic reflect path.

Add comprehensive unit tests in marshal_vector_test.go covering all code paths in the vector float fast-path implementation: - Round-trip marshal/unmarshal for []float64 and []float32 - Byte-level wire format compatibility (big-endian IEEE 754) - Slice reuse on repeated unmarshal (cap == dim and cap > dim) - Nil data unmarshal with both nil and non-nil destination slices - Nil slice and nil pointer marshal (typed nil, *[]T nil, *T→nil slice) - Dimension mismatch errors on marshal and data length errors on unmarshal - Empty vector (dim=0) round-trip - Pointer-to-slice (*[]float64/*[]float32) marshal via Marshal() API - Special IEEE 754 values: ±Inf, NaN, MaxFloat, SmallestNonzero, -0.0 Test helpers (makeDoubleVectorType, makeFloat32VectorType) now set the custom field on NativeType for consistency with makeFloatVectorType in vector_bench_test.go. Pointer-to-slice tests now go through the public Marshal() API rather than calling marshalVector() directly, validating the real code path where Marshal() dereferences pointers before dispatch. 10 test functions with 38 subtests total.

Add type-specialized fast paths for vector<float>, vector<double>, vector<int>, and vector<bigint> that bypass reflect-based per-element marshaling in favor of direct encoding/binary bulk conversion. Changes in marshal.go: - Type switches in marshalVector()/unmarshalVector() dispatch to dedicated functions for []float32, []float64, []int32, []int64 before falling through to the generic reflect path. - 8 new functions: marshalVectorFloat32, marshalVectorFloat64, unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32, marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64. - sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf) for zero-alloc steady state when callers return buffers after the framer copies them. 64KiB cap prevents pool bloat. - Unmarshal fast paths reuse destination slice backing array when capacity is sufficient (zero-alloc steady state on read path). - Generic path preallocation via vectorFixedElemSize() + buf.Grow() for non-fast-path fixed-size types (e.g. UUID, timestamp). Benchmark results for vector<float, 1536> (typical embedding dimension): Marshal (baseline -> optimized): 86.4 us/op -> 3.4 us/op (25x faster) 3081 allocs -> 2 allocs (99.94% fewer) 28632 B/op -> 6172 B/op (78% less memory) Marshal with pool return (steady state): 86.4 us/op -> 1.6 us/op (54x faster) 3081 allocs -> 2 allocs (99.94% fewer) 28632 B/op -> 48 B/op (99.8% less memory) Unmarshal (baseline -> optimized): 60.2 us/op -> 1.5 us/op (41x faster) 2 allocs -> 0 allocs (100% fewer) 6168 B/op -> 0 B/op (100% less memory) Round-trip (baseline -> optimized, pooled): 147.8 us/op -> 3.1 us/op (48x faster) 3083 allocs -> 2 allocs (99.94% fewer) 34800 B/op -> 48 B/op (99.9% less memory) Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%) New test files: - marshal_vector_test.go: 58 unit subtests across 13 categories (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch, empty-vector, pointer-to-slice, special-values, pool-concurrency, oversized-not-pooled, fixed-elem-size, generic-prealloc). - vector_bench_test.go: extended with int32/int64 and pooled benchmarks. - tests/bench/bench_vector_public_test.go: public API benchmarks for int32/int64 marshal/unmarshal. Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc). Extends with int32/int64 fast paths and buffer pooling not covered by any existing PR.

Add type-specialized fast paths for vector<float>, vector<double>, vector<int>, and vector<bigint> that bypass reflect-based per-element marshaling in favor of direct encoding/binary bulk conversion. Changes in marshal.go: - Type switches in marshalVector()/unmarshalVector() dispatch to dedicated functions for []float32, []float64, []int32, []int64 before falling through to the generic reflect path. - 8 new functions: marshalVectorFloat32, marshalVectorFloat64, unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32, marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64. - sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf) for zero-alloc steady state when callers return buffers after the framer copies them. 64KiB cap prevents pool bloat. - Unmarshal fast paths reuse destination slice backing array when capacity is sufficient (zero-alloc steady state on read path). - Generic path preallocation via vectorFixedElemSize() + buf.Grow() for non-fast-path fixed-size types (e.g. UUID, timestamp). - vectorByteSize() helper guards against integer overflow on 32-bit platforms with corrupt or adversarial schema metadata. - All fast-path errors are wrapped as MarshalError/UnmarshalError for consistent error typing. - dim=0 vectors correctly encode as non-nil empty values (not CQL NULL) in both fast paths and generic path. - Negative dimensions are rejected with clear error messages. Benchmark results for vector<float, 1536> (typical embedding dimension): Marshal (baseline -> optimized): 86.4 us/op -> 3.4 us/op (25x faster) 3081 allocs -> 2 allocs (99.94% fewer) 28632 B/op -> 6172 B/op (78% less memory) Marshal with pool return (steady state): 86.4 us/op -> 1.6 us/op (54x faster) 3081 allocs -> 2 allocs (99.94% fewer) 28632 B/op -> 48 B/op (99.8% less memory) Unmarshal (baseline -> optimized): 60.2 us/op -> 1.5 us/op (41x faster) 2 allocs -> 0 allocs (100% fewer) 6168 B/op -> 0 B/op (100% less memory) Round-trip (baseline -> optimized, pooled): 147.8 us/op -> 3.1 us/op (48x faster) 3083 allocs -> 2 allocs (99.94% fewer) 34800 B/op -> 48 B/op (99.9% less memory) Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%) New test files: - marshal_vector_test.go: 58+ unit subtests across 13 categories (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch, empty-vector, pointer-to-slice, special-values, pool-concurrency, oversized-not-pooled, fixed-elem-size, generic-prealloc). - vector_bench_test.go: extended with int32/int64 and pooled benchmarks. - tests/bench/bench_vector_public_test.go: public API benchmarks for int32/int64 marshal/unmarshal. Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc). Extends with int32/int64 fast paths and buffer pooling not covered by any existing PR.

Add type-specialized fast paths for vector<float>, vector<double>, vector<int>, and vector<bigint> that bypass reflect-based per-element marshaling in favor of direct encoding/binary bulk conversion. Changes in marshal.go: - Type switches in marshalVector()/unmarshalVector() dispatch to dedicated functions for []float32, []float64, []int32, []int64 before falling through to the generic reflect path. - 8 new functions: marshalVectorFloat32, marshalVectorFloat64, unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32, marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64. - sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf) for zero-alloc steady state when callers return buffers after the framer copies them. 64KiB cap prevents pool bloat. - Unmarshal fast paths reuse destination slice backing array when capacity is sufficient (zero-alloc steady state on read path). - Generic path preallocation via vectorFixedElemSize() + buf.Grow() for non-fast-path fixed-size types (e.g. UUID, timestamp). - vectorByteSize() helper guards against integer overflow on 32-bit platforms with corrupt or adversarial schema metadata. - All fast-path errors are wrapped as MarshalError/UnmarshalError for consistent error typing. - dim=0 vectors correctly encode as non-nil empty values (not CQL NULL) in both fast paths and generic path. - Negative dimensions are rejected with clear error messages. Benchmark results for vector<float, 1536> (typical embedding dimension): Marshal (baseline -> optimized): 86.4 us/op -> 3.4 us/op (25x faster) 3081 allocs -> 2 allocs (99.94% fewer) 28632 B/op -> 6172 B/op (78% less memory) Marshal with pool return (steady state): 86.4 us/op -> 1.6 us/op (54x faster) 3081 allocs -> 2 allocs (99.94% fewer) 28632 B/op -> 48 B/op (99.8% less memory) Unmarshal (baseline -> optimized): 60.2 us/op -> 1.5 us/op (41x faster) 2 allocs -> 0 allocs (100% fewer) 6168 B/op -> 0 B/op (100% less memory) Round-trip (baseline -> optimized, pooled): 147.8 us/op -> 3.1 us/op (48x faster) 3083 allocs -> 2 allocs (99.94% fewer) 34800 B/op -> 48 B/op (99.9% less memory) Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%) New test files: - marshal_vector_test.go: 58+ unit subtests across 13 categories (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch, empty-vector, pointer-to-slice, special-values, pool-concurrency, oversized-not-pooled, fixed-elem-size, generic-prealloc). - vector_bench_test.go: extended with int32/int64 and pooled benchmarks. - tests/bench/bench_vector_public_test.go: public API benchmarks for int32/int64 marshal/unmarshal. Subsumes PR #744 (float fast paths) and PR #745 (generic prealloc). Extends with int32/int64 fast paths and buffer pooling not covered by any existing PR.

mykaul requested a review from Copilot March 7, 2026 09:50

Copilot started reviewing on behalf of mykaul March 7, 2026 09:51 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

mykaul force-pushed the vector_float_fast_paths branch from 350506b to aecae6c Compare March 7, 2026 16:43

mykaul marked this pull request as draft March 8, 2026 07:32

mykaul added 2 commits March 12, 2026 10:25

mykaul force-pushed the vector_float_fast_paths branch from aecae6c to 5814a13 Compare March 12, 2026 10:57

mykaul mentioned this pull request Mar 13, 2026

perf: optimize vector marshal/unmarshal for float32/float64/int32/int64 (Throughput: 75 MiB/s → 1.9 GiB/s (marshal), 106 MiB/s → 4.6 GiB/s (unmarshal) ) #770

Draft

mykaul changed the title ~~(improvement) perf: Vector float fast paths~~ (improvement) perf: Vector float fast paths (subset of https://github.com/scylladb/gocql/pull/770 !) Mar 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(improvement) perf: Vector float fast paths (subset of https://github.com/scylladb/gocql/pull/770 !)#744

(improvement) perf: Vector float fast paths (subset of https://github.com/scylladb/gocql/pull/770 !)#744
mykaul wants to merge 2 commits intoscylladb:masterfrom
mykaul:vector_float_fast_paths

mykaul commented Mar 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mykaul commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mykaul commented Mar 7, 2026 •

edited

Loading