Skip to content

(improvement) perf: Vector float fast paths (subset of https://github.com/scylladb/gocql/pull/770 !)#744

Draft
mykaul wants to merge 2 commits intoscylladb:masterfrom
mykaul:vector_float_fast_paths
Draft

(improvement) perf: Vector float fast paths (subset of https://github.com/scylladb/gocql/pull/770 !)#744
mykaul wants to merge 2 commits intoscylladb:masterfrom
mykaul:vector_float_fast_paths

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Mar 7, 2026

This is a major improvement to float types marshal / unmarshal code paths in order to improve vector arrays performance in the driver.
It begins with a commit that adds unit tests to the code, and then adds the specialization for specific common types of vector arrays.

The performance is immensely improved.
There are additional, less impactful performance improvements to the overall execution of marshal/unmarshal paths of this and other types, I'll submit them separately.

Perf numbers:
768-dim float32 results (benchstat, n=5):
Marshal: 30.3µs → 1.4µs (-95.3%, 1544 → 2 allocs)
Unmarshal: 25.1µs → 631ns (-97.5%, 2 → 0 allocs)

Throughput exceeds 1.5 GB/s marshal, 4.5 GB/s unmarshal for float32
vectors — roughly 15-35x faster than the generic reflect path.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves vector marshaling/unmarshaling performance for common float vector types in the GoCQL driver by adding specialized fast paths, backed by new unit tests to validate correctness and key behavioral expectations.

Changes:

  • Add fast-path marshal/unmarshal implementations for vector<double> ([]float64) and vector<float> ([]float32) that bypass reflect and per-element dispatch.
  • Add unit tests covering round-trips, byte-compatibility, nil handling, dimension mismatches, and slice-reuse behavior for float vectors.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
marshal.go Introduces float32/float64 vector fast paths and helper encode/decode functions using big-endian IEEE754 layout.
marshal_vector_test.go Adds unit tests validating correctness, edge cases, and allocation behavior for float vector fast paths.

Comment thread marshal_vector_test.go
Comment on lines +31 to +46
// makeDoubleVectorType creates a VectorType for vector<double> with the given dimension.
func makeDoubleVectorType(dim int) VectorType {
return VectorType{
NativeType: NativeType{proto: protoVersion4, typ: TypeCustom},
SubType: NativeType{proto: protoVersion4, typ: TypeDouble},
Dimensions: dim,
}
}

// makeFloat32VectorType creates a VectorType for vector<float> with the given dimension.
func makeFloat32VectorType(dim int) VectorType {
return VectorType{
NativeType: NativeType{proto: protoVersion4, typ: TypeCustom},
SubType: NativeType{proto: protoVersion4, typ: TypeFloat},
Dimensions: dim,
}
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The makeDoubleVectorType / makeFloat32VectorType helpers construct a VectorType with NativeType{typ: TypeCustom} but leave custom empty. Other vector-related tests/benchmarks in this repo set custom to the Cassandra vector type prefix/spec, which makes error messages and String() output much more representative. Consider setting custom (e.g., at least apacheCassandraTypePrefix+"VectorType", or the full VectorType(<subtype>, <dim>) spec) to better match real parsed schema types.

Copilot uses AI. Check for mistakes.
Comment thread marshal.go
Comment on lines +843 to +870
// Fast paths for []float64/[]float32 — bypass reflect and per-element Marshal dispatch.
switch info.SubType.Type() {
case TypeDouble:
switch v := value.(type) {
case []float64:
if v == nil {
return nil, nil
}
return marshalVectorFloat64(info.Dimensions, v)
case *[]float64:
if v == nil || *v == nil {
return nil, nil
}
return marshalVectorFloat64(info.Dimensions, *v)
}
case TypeFloat:
switch v := value.(type) {
case []float32:
if v == nil {
return nil, nil
}
return marshalVectorFloat32(info.Dimensions, v)
case *[]float32:
if v == nil || *v == nil {
return nil, nil
}
return marshalVectorFloat32(info.Dimensions, *v)
}
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

marshalVector now accepts *[]float64 / *[]float32 only via the float fast paths, while the generic vector path still rejects pointer-to-slice values. Since Marshal() already dereferences pointers before calling marshalVector, this special-casing is a bit inconsistent and may be surprising for any internal callers that use marshalVector directly. Consider either removing the pointer cases (and testing pointer behavior via Marshal()), or adding a small pointer-dereference step at the start of marshalVector so pointer inputs behave consistently for all vector subtypes.

Copilot uses AI. Check for mistakes.
@mykaul mykaul force-pushed the vector_float_fast_paths branch from 350506b to aecae6c Compare March 7, 2026 16:43
@mykaul mykaul marked this pull request as draft March 8, 2026 07:32
mykaul added 2 commits March 12, 2026 10:25
Add direct type-assertion fast paths in marshalVector and unmarshalVector
for []float64 and []float32. These bypass the reflect-based generic path
and per-element Marshal/Unmarshal dispatch chain, using tight
binary.BigEndian loops instead.

Pointer-to-slice (*[]float64/*[]float32) is not handled in marshalVector
because the top-level Marshal() function already dereferences pointers
before dispatch. This keeps pointer handling in one place.

Marshal: pre-allocate exact output buffer, encode via PutUint64/PutUint32.
Unmarshal: reuse destination slice when capacity suffices (zero allocs).
Note: the generic reflect-based path always allocates a new slice; this
intentional divergence is what delivers the zero-alloc steady-state for
repeated unmarshal calls.

768-dim float32 results (benchstat, n=5):
  Marshal:   30.3µs → 1.4µs   (-95.3%, 1544 → 2 allocs)
  Unmarshal: 25.1µs → 631ns   (-97.5%, 2 → 0 allocs)

Throughput exceeds 1.5 GB/s marshal, 4.5 GB/s unmarshal for float32
vectors — roughly 15-35x faster than the generic reflect path.
Add comprehensive unit tests in marshal_vector_test.go covering all
code paths in the vector float fast-path implementation:

- Round-trip marshal/unmarshal for []float64 and []float32
- Byte-level wire format compatibility (big-endian IEEE 754)
- Slice reuse on repeated unmarshal (cap == dim and cap > dim)
- Nil data unmarshal with both nil and non-nil destination slices
- Nil slice and nil pointer marshal (typed nil, *[]T nil, *T→nil slice)
- Dimension mismatch errors on marshal and data length errors on unmarshal
- Empty vector (dim=0) round-trip
- Pointer-to-slice (*[]float64/*[]float32) marshal via Marshal() API
- Special IEEE 754 values: ±Inf, NaN, MaxFloat, SmallestNonzero, -0.0

Test helpers (makeDoubleVectorType, makeFloat32VectorType) now set the
custom field on NativeType for consistency with makeFloatVectorType in
vector_bench_test.go.

Pointer-to-slice tests now go through the public Marshal() API rather
than calling marshalVector() directly, validating the real code path
where Marshal() dereferences pointers before dispatch.

10 test functions with 38 subtests total.
@mykaul mykaul force-pushed the vector_float_fast_paths branch from aecae6c to 5814a13 Compare March 12, 2026 10:57
mykaul added a commit to mykaul/gocql that referenced this pull request Mar 13, 2026
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.

Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
  dedicated functions for []float32, []float64, []int32, []int64
  before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
  unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
  marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
  for zero-alloc steady state when callers return buffers after
  the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
  capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
  for non-fast-path fixed-size types (e.g. UUID, timestamp).

Benchmark results for vector<float, 1536> (typical embedding dimension):

  Marshal (baseline -> optimized):
    86.4 us/op  ->  3.4 us/op  (25x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  6172 B/op   (78% less memory)

  Marshal with pool return (steady state):
    86.4 us/op  ->  1.6 us/op  (54x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  48 B/op     (99.8% less memory)

  Unmarshal (baseline -> optimized):
    60.2 us/op  ->  1.5 us/op  (41x faster)
    2 allocs    ->  0 allocs    (100% fewer)
    6168 B/op   ->  0 B/op      (100% less memory)

  Round-trip (baseline -> optimized, pooled):
    147.8 us/op ->  3.1 us/op  (48x faster)
    3083 allocs ->  2 allocs    (99.94% fewer)
    34800 B/op  ->  48 B/op     (99.9% less memory)

  Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)

New test files:
- marshal_vector_test.go: 58 unit subtests across 13 categories
  (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
  empty-vector, pointer-to-slice, special-values, pool-concurrency,
  oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
  int32/int64 marshal/unmarshal.

Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
mykaul added a commit to mykaul/gocql that referenced this pull request Mar 13, 2026
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.

Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
  dedicated functions for []float32, []float64, []int32, []int64
  before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
  unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
  marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
  for zero-alloc steady state when callers return buffers after
  the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
  capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
  for non-fast-path fixed-size types (e.g. UUID, timestamp).

Benchmark results for vector<float, 1536> (typical embedding dimension):

  Marshal (baseline -> optimized):
    86.4 us/op  ->  3.4 us/op  (25x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  6172 B/op   (78% less memory)

  Marshal with pool return (steady state):
    86.4 us/op  ->  1.6 us/op  (54x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  48 B/op     (99.8% less memory)

  Unmarshal (baseline -> optimized):
    60.2 us/op  ->  1.5 us/op  (41x faster)
    2 allocs    ->  0 allocs    (100% fewer)
    6168 B/op   ->  0 B/op      (100% less memory)

  Round-trip (baseline -> optimized, pooled):
    147.8 us/op ->  3.1 us/op  (48x faster)
    3083 allocs ->  2 allocs    (99.94% fewer)
    34800 B/op  ->  48 B/op     (99.9% less memory)

  Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)

New test files:
- marshal_vector_test.go: 58 unit subtests across 13 categories
  (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
  empty-vector, pointer-to-slice, special-values, pool-concurrency,
  oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
  int32/int64 marshal/unmarshal.

Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
mykaul added a commit to mykaul/gocql that referenced this pull request Mar 13, 2026
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.

Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
  dedicated functions for []float32, []float64, []int32, []int64
  before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
  unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
  marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
  for zero-alloc steady state when callers return buffers after
  the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
  capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
  for non-fast-path fixed-size types (e.g. UUID, timestamp).

Benchmark results for vector<float, 1536> (typical embedding dimension):

  Marshal (baseline -> optimized):
    86.4 us/op  ->  3.4 us/op  (25x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  6172 B/op   (78% less memory)

  Marshal with pool return (steady state):
    86.4 us/op  ->  1.6 us/op  (54x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  48 B/op     (99.8% less memory)

  Unmarshal (baseline -> optimized):
    60.2 us/op  ->  1.5 us/op  (41x faster)
    2 allocs    ->  0 allocs    (100% fewer)
    6168 B/op   ->  0 B/op      (100% less memory)

  Round-trip (baseline -> optimized, pooled):
    147.8 us/op ->  3.1 us/op  (48x faster)
    3083 allocs ->  2 allocs    (99.94% fewer)
    34800 B/op  ->  48 B/op     (99.9% less memory)

  Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)

New test files:
- marshal_vector_test.go: 58 unit subtests across 13 categories
  (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
  empty-vector, pointer-to-slice, special-values, pool-concurrency,
  oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
  int32/int64 marshal/unmarshal.

Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
@mykaul mykaul changed the title (improvement) perf: Vector float fast paths (improvement) perf: Vector float fast paths (subset of https://github.com/scylladb/gocql/pull/770 !) Mar 15, 2026
mykaul added a commit to mykaul/gocql that referenced this pull request Mar 24, 2026
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.

Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
  dedicated functions for []float32, []float64, []int32, []int64
  before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
  unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
  marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
  for zero-alloc steady state when callers return buffers after
  the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
  capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
  for non-fast-path fixed-size types (e.g. UUID, timestamp).

Benchmark results for vector<float, 1536> (typical embedding dimension):

  Marshal (baseline -> optimized):
    86.4 us/op  ->  3.4 us/op  (25x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  6172 B/op   (78% less memory)

  Marshal with pool return (steady state):
    86.4 us/op  ->  1.6 us/op  (54x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  48 B/op     (99.8% less memory)

  Unmarshal (baseline -> optimized):
    60.2 us/op  ->  1.5 us/op  (41x faster)
    2 allocs    ->  0 allocs    (100% fewer)
    6168 B/op   ->  0 B/op      (100% less memory)

  Round-trip (baseline -> optimized, pooled):
    147.8 us/op ->  3.1 us/op  (48x faster)
    3083 allocs ->  2 allocs    (99.94% fewer)
    34800 B/op  ->  48 B/op     (99.9% less memory)

  Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)

New test files:
- marshal_vector_test.go: 58 unit subtests across 13 categories
  (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
  empty-vector, pointer-to-slice, special-values, pool-concurrency,
  oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
  int32/int64 marshal/unmarshal.

Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
mykaul added a commit to mykaul/gocql that referenced this pull request Apr 3, 2026
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.

Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
  dedicated functions for []float32, []float64, []int32, []int64
  before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
  unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
  marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
  for zero-alloc steady state when callers return buffers after
  the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
  capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
  for non-fast-path fixed-size types (e.g. UUID, timestamp).
- vectorByteSize() helper guards against integer overflow on 32-bit
  platforms with corrupt or adversarial schema metadata.
- All fast-path errors are wrapped as MarshalError/UnmarshalError
  for consistent error typing.
- dim=0 vectors correctly encode as non-nil empty values (not CQL NULL)
  in both fast paths and generic path.
- Negative dimensions are rejected with clear error messages.

Benchmark results for vector<float, 1536> (typical embedding dimension):

  Marshal (baseline -> optimized):
    86.4 us/op  ->  3.4 us/op  (25x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  6172 B/op   (78% less memory)

  Marshal with pool return (steady state):
    86.4 us/op  ->  1.6 us/op  (54x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  48 B/op     (99.8% less memory)

  Unmarshal (baseline -> optimized):
    60.2 us/op  ->  1.5 us/op  (41x faster)
    2 allocs    ->  0 allocs    (100% fewer)
    6168 B/op   ->  0 B/op      (100% less memory)

  Round-trip (baseline -> optimized, pooled):
    147.8 us/op ->  3.1 us/op  (48x faster)
    3083 allocs ->  2 allocs    (99.94% fewer)
    34800 B/op  ->  48 B/op     (99.9% less memory)

  Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)

New test files:
- marshal_vector_test.go: 58+ unit subtests across 13 categories
  (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
  empty-vector, pointer-to-slice, special-values, pool-concurrency,
  oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
  int32/int64 marshal/unmarshal.

Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
mykaul added a commit that referenced this pull request Apr 10, 2026
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.

Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
  dedicated functions for []float32, []float64, []int32, []int64
  before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
  unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
  marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
  for zero-alloc steady state when callers return buffers after
  the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
  capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
  for non-fast-path fixed-size types (e.g. UUID, timestamp).
- vectorByteSize() helper guards against integer overflow on 32-bit
  platforms with corrupt or adversarial schema metadata.
- All fast-path errors are wrapped as MarshalError/UnmarshalError
  for consistent error typing.
- dim=0 vectors correctly encode as non-nil empty values (not CQL NULL)
  in both fast paths and generic path.
- Negative dimensions are rejected with clear error messages.

Benchmark results for vector<float, 1536> (typical embedding dimension):

  Marshal (baseline -> optimized):
    86.4 us/op  ->  3.4 us/op  (25x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  6172 B/op   (78% less memory)

  Marshal with pool return (steady state):
    86.4 us/op  ->  1.6 us/op  (54x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  48 B/op     (99.8% less memory)

  Unmarshal (baseline -> optimized):
    60.2 us/op  ->  1.5 us/op  (41x faster)
    2 allocs    ->  0 allocs    (100% fewer)
    6168 B/op   ->  0 B/op      (100% less memory)

  Round-trip (baseline -> optimized, pooled):
    147.8 us/op ->  3.1 us/op  (48x faster)
    3083 allocs ->  2 allocs    (99.94% fewer)
    34800 B/op  ->  48 B/op     (99.9% less memory)

  Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)

New test files:
- marshal_vector_test.go: 58+ unit subtests across 13 categories
  (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
  empty-vector, pointer-to-slice, special-values, pool-concurrency,
  oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
  int32/int64 marshal/unmarshal.

Subsumes PR #744 (float fast paths) and PR #745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants