Skip to content

[PRE-TEST] Critical Tensor Geometry Failures and Metric Inconsistencies in cloud_VLA_finetune Benchmark #369

@satyam-coder07

Description

@satyam-coder07

1. Background

Introduction
The examples/cloud_VLA_finetune benchmark is a cutting-edge addition to KubeEdge-Ianvs, designed to evaluate Vision-Language-Action (VLA) models in robotics scenarios. It leverages the singletasklearning paradigm to simulate how foundation models (like OpenVLA or RT-2) can be fine-tuned and deployed across the cloud-edge continuum.

The Problem
While the inference phase completes, the benchmark suffers from a fundamental pipeline crash during evaluation. After a deep dive into the testcasecontroller, I identified that the current implementation assumes a static output shape, which is incompatible with the variable-length nature of VLA action sequences.

The example doesn't just fail due to dependencies; it fails due to a logical mismatch in how Ianvs handles multimodal sequence data.

Debug Process and Logs
I initiated the benchmark and traced the failure to the metric computation layer. The crash is a sequential result of two main failures:

Failure 1: Tensor Broadcasting Mismatch
The model predicts a sequence of actions (e.g., a trajectory of 16 steps), but due to edge-simulated data truncation or variable-length ground truths, the evaluation tensor comparison fails.

RuntimeError: The size of tensor a (12) must match the size of tensor b (16) at non-singleton dimension 1

Failure 2: Lack of Sequence Masking
Even when shapes are coerced, the Mean Squared Error (MSE) calculation includes "padding zeros" as actual data, which artificially inflates model accuracy and invalidates the benchmark's scientific integrity.

2. Goals
What I Would Contribute

  • Implement Dynamic Padding: Integrate torch.nn.utils.rnn.pad_sequence into the evaluation algorithm to handle variable-length VLA outputs gracefully.

  • Standardise Metric Interfaces: Modify the local algorithm script to ensure it passes sequence-length metadata to the ianvs/core metrics module.

  • Configurable Sequence Constraints: Update testenv.yaml to allow users to define max_sequence_length and padding_strategy directly from the YAML.

  • Hardware-Agnostic Validation: Fix the hardcoded CUDA assumptions in the VLA fine-tuning script to allow developers on Mac/CPU environments to validate their pipelines.

3. Scope
Expected Users

Robotics Researchers: Who need to benchmark foundation models (VLAs) for real-world manipulation tasks.

Edge AI Engineers: Testing how model quantisation or pruning affects the trajectory precision of action sequences.

Uniqueness of This Issue:
Unlike most pre-test submissions that focus on surface-level bugs (missing requirements.txt or broken paths), this proposal addresses a core ML pipeline failure. It demonstrates an understanding of tensor geometry and distributed benchmarking logic, positioning this as a high-impact restoration rather than a simple patch.

4. Detailed Design
Architecture

The project utilizes the singletasklearning paradigm. The fix will be isolated to the example directory to prevent breaking the core Ianvs framework:
ianvs/examples/cloud_VLA_finetune/

Module Details

  • Metric Wrapper: I will introduce a SequenceAwareMetric class within the example's algorithm folder. This wrapper will apply a boolean mask to ignored padded indices before calculating the success rate.

  • Core Interface: I will ensure the TestCaseController receives a dictionary containing both y_pred and y_true as padded tensors, ensuring compatibility with the existing evaluate() call.

  • YAML Refactor: testenv.yaml will be updated to include a dataset_params block for handling sequence truncation.

5. Road Map

Month 1: Stabilisation & Padding Fix (Weeks 1-4)

  • Reproduce the shape mismatch on a controlled dataset.

  • Implement the pad_sequence logic and validate that the benchmark runs end-to-end without a RuntimeError.

Month 2: Metric Integrity & Configuration (Weeks 5-8)

  • Implement sequence masking to ensure zeros/padding do not skew the MSE results.

  • Refactor testenv.yaml to support dynamic sequence parameters.

Month 3: CI/CD & Documentation (Weeks 9-12)

  • Create a GitHub Action that uses a "toy" VLA model to test the evaluation pipeline on every PR.

  • Write a "VLA Benchmarking Guide" for the KubeEdge website to help new contributors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions