[PRE-TEST] Critical Tensor Geometry Failures and Metric Inconsistencies in cloud_VLA_finetune Benchmark

**1. Background**

**Introduction**
The examples/cloud_VLA_finetune benchmark is a cutting-edge addition to KubeEdge-Ianvs, designed to evaluate Vision-Language-Action (VLA) models in robotics scenarios. It leverages the singletasklearning paradigm to simulate how foundation models (like OpenVLA or RT-2) can be fine-tuned and deployed across the cloud-edge continuum.

**The Problem**
While the inference phase completes, the benchmark suffers from a fundamental pipeline crash during evaluation. After a deep dive into the testcasecontroller, I identified that the current implementation assumes a static output shape, which is incompatible with the variable-length nature of VLA action sequences.

The example doesn't just fail due to dependencies; it fails due to a logical mismatch in how Ianvs handles multimodal sequence data.

**Debug Process and Logs**
I initiated the benchmark and traced the failure to the metric computation layer. The crash is a sequential result of two main failures:

**Failure 1: Tensor Broadcasting Mismatch**
The model predicts a sequence of actions (e.g., a trajectory of 16 steps), but due to edge-simulated data truncation or variable-length ground truths, the evaluation tensor comparison fails.

```python
RuntimeError: The size of tensor a (12) must match the size of tensor b (16) at non-singleton dimension 1
```
**Failure 2: Lack of Sequence Masking**
Even when shapes are coerced, the Mean Squared Error (MSE) calculation includes "padding zeros" as actual data, which artificially inflates model accuracy and invalidates the benchmark's scientific integrity.

**2. Goals**
**What I Would Contribute**

- Implement Dynamic Padding: Integrate torch.nn.utils.rnn.pad_sequence into the evaluation algorithm to handle variable-length VLA outputs gracefully.

- Standardise Metric Interfaces: Modify the local algorithm script to ensure it passes sequence-length metadata to the ianvs/core metrics module.

- Configurable Sequence Constraints: Update testenv.yaml to allow users to define max_sequence_length and padding_strategy directly from the YAML.

- Hardware-Agnostic Validation: Fix the hardcoded CUDA assumptions in the VLA fine-tuning script to allow developers on Mac/CPU environments to validate their pipelines.

**3. Scope
Expected Users**

Robotics Researchers: Who need to benchmark foundation models (VLAs) for real-world manipulation tasks.

Edge AI Engineers: Testing how model quantisation or pruning affects the trajectory precision of action sequences.

Uniqueness of This Issue:
Unlike most pre-test submissions that focus on surface-level bugs (missing requirements.txt or broken paths), this proposal addresses a core ML pipeline failure. It demonstrates an understanding of tensor geometry and distributed benchmarking logic, positioning this as a high-impact restoration rather than a simple patch.

**4. Detailed Design
Architecture**
The project utilizes the singletasklearning paradigm. The fix will be isolated to the example directory to prevent breaking the core Ianvs framework:
ianvs/examples/cloud_VLA_finetune/

**Module Details**

- Metric Wrapper: I will introduce a SequenceAwareMetric class within the example's algorithm folder. This wrapper will apply a boolean mask to ignored padded indices before calculating the success rate.

- Core Interface: I will ensure the TestCaseController receives a dictionary containing both y_pred and y_true as padded tensors, ensuring compatibility with the existing evaluate() call.

- YAML Refactor: testenv.yaml will be updated to include a dataset_params block for handling sequence truncation.

**5. Road Map**

**Month 1: Stabilisation & Padding Fix (Weeks 1-4)**

- Reproduce the shape mismatch on a controlled dataset.

- Implement the pad_sequence logic and validate that the benchmark runs end-to-end without a RuntimeError.

**Month 2: Metric Integrity & Configuration (Weeks 5-8)**

- Implement sequence masking to ensure zeros/padding do not skew the MSE results.

- Refactor testenv.yaml to support dynamic sequence parameters.

**Month 3: CI/CD & Documentation (Weeks 9-12)**

- Create a GitHub Action that uses a "toy" VLA model to test the evaluation pipeline on every PR.

- Write a "VLA Benchmarking Guide" for the KubeEdge website to help new contributors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PRE-TEST] Critical Tensor Geometry Failures and Metric Inconsistencies in cloud_VLA_finetune Benchmark #369

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[PRE-TEST] Critical Tensor Geometry Failures and Metric Inconsistencies in cloud_VLA_finetune Benchmark #369

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions