Speaker Diarization - TypeError: iteration over a 0-d tensor in    audio_to_label.py

**Describe the bug**

Speaker diarization consistently fails during the embedding extraction phase with a `TypeError: iteration over a 0-d tensor` error in the `_fixed_seq_collate_fn` function. The error occurs after successful VAD (Voice Activity Detection) and speech segmentation, specifically when the speaker embedding model attempts to process the first batch of subsegments.

The error occurs in `audio_to_label.py` at line 127 where `zip(*batch)` is called, but the `batch` variable contains a 0-dimensional tensor instead of the expected list/tuple structure.

**Steps/Code to reproduce bug**

1. Create a manifest file with audio chunks (WAV format, 16kHz, mono):
```json
{"audio_filepath": "/path/to/chunk.wav", "offset": 0, "duration": 300.0, "label": "infer", "text": "-", "num_speakers": 6, "rttm_filepath": null, "uem_filepath": null}
```

2. Run the following minimal code:
```python
import nemo.collections.asr as nemo_asr
from nemo.collections.asr.models import ClusteringDiarizer
from omegaconf import OmegaConf

config = {
    'diarizer': {
        'manifest_filepath': '/path/to/manifest.json',
        'out_dir': '/output/dir',
        'oracle_vad': False,
        'oracle_num_speakers': False,
        'max_num_speakers': 6,
        'device': 'cpu',
        'batch_size': 1,

        'msdd_model': {
            'model_path': 'diar_msdd_telephonic',
            'parameters': {
                'use_speaker_model_from_ckpt': True,
                'infer_batch_size': 1,
                'sigmoid_threshold': [0.7],
            }
        },

        'speaker_embeddings': {
            'model_path': 'ecapa_tdnn',
            'parameters': {
                'window_length_in_sec': [1.5],
                'shift_length_in_sec': [0.75],
                'multiscale_weights': [1],
                'save_embeddings': False,
                'batch_size': 1,
                'device': 'cpu',
                'num_workers': 0,
            }
        },

        'clustering': {
            'parameters': {
                'oracle_num_speakers': False,
                'max_num_speakers': 6,
                'enhanced_count_thres': 80,
                'max_rp_threshold': 0.25,
                'sparse_search_volume': 30,
            }
        },

        'vad': {
            'model_path': 'vad_multilingual_marblenet',
            'parameters': {
                'window_length_in_sec': 0.15,
                'shift_length_in_sec': 0.01,
                'smoothing': 'median',
                'overlap': 0.875,
                'onset': 0.8,
                'offset': 0.6,
                'pad_onset': 0.05,
                'pad_offset': -0.05,
                'min_duration_on': 0.2,
                'min_duration_off': 0.2,
                'filter_speech_first': True,
            }
        }
    }
}

config_omega = OmegaConf.create(config)
OmegaConf.set_struct(config_omega, False)
config_omega.device = 'cpu'
config_omega.num_workers = 0
config_omega.sample_rate = 16000
config_omega.verbose = True

msdd_model = ClusteringDiarizer(cfg=config_omega)
msdd_model.diarize()  # Fails here
```

**Expected behavior**

Speaker diarization should complete successfully, producing RTTM files with speaker labels and timestamps.

**Environment overview (please complete the following information)**

- Environment location: Bare-metal
- Method of NeMo install: `pip install 'nemo_toolkit[asr]'`
- Not using Docker

**Environment details**

- OS version: macOS 15.4 (Darwin 25.4.0) on Apple Silicon (ARM64)
- PyTorch version: Tested with 2.6.0, 2.7.0, 2.11.0 (all fail)
- Python version: Tested with 3.10.20, 3.12.13, 3.13.12 (all fail)
- NeMo version: 2.7.2

**Additional context**

**Error traceback:**
```
[NeMo I 2026-03-30 17:45:03 clustering_diarizer:434] Extracting embeddings for Diarization
[NeMo I 2026-03-30 17:45:03 collections:803] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2026-03-30 17:45:03 collections:803] Dataset successfully loaded with 1379 items and total duration provided from manifest is  0.38 hours.
[NeMo I 2026-03-30 17:45:03 collections:803] # 1379 files loaded accounting to # 1 labels
[1/1] extract embeddings:   0%|          | 0/1379 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "/tmp/nemo_speaker_diarization.py", line 157, in diarize_chunks
    msdd_model.diarize()
  File "/path/to/nemo/collections/asr/models/clustering_diarizer.py", line 434, in diarize
    self._extract_embeddings(self.subsegments_manifest_path, scale_idx, len(scales))
  File "/path/to/nemo/collections/asr/models/clustering_diarizer.py", line 344, in _extract_embeddings
    for test_batch in tqdm(
  File "/path/to/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/path/to/torch/utils/data/dataloader.py", line 708, in __next__
    data = self._next_data()
  File "/path/to/torch/utils/data/dataloader.py", line 764, in _next_data
    data = self._dataset_fetcher.fetch(index)
  File "/path/to/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
  File "/path/to/nemo/collections/asr/data/audio_to_label.py", line 461, in fixed_seq_collate_fn
    return _fixed_seq_collate_fn(self, batch)
  File "/path/to/nemo/collections/asr/data/audio_to_label.py", line 127, in _fixed_seq_collate_fn
    _, audio_lengths, _, tokens_lengths = zip(*batch)
  File "/path/to/torch/_tensor.py", line 1154, in __iter__
    raise TypeError("iteration over a 0-d tensor")
TypeError: iteration over a 0-d tensor
```

**What works successfully:**
- VAD (Voice Activity Detection) completes successfully
- Speech segment generation completes successfully (creates 1379 subsegments)
- Models load successfully (vad_multilingual_marblenet, ecapa_tdnn, titanet_large)
- NeMo ASR transcription works perfectly with the same audio files

**Tested configurations (all fail with identical error):**
- Python 3.10.20 + PyTorch 2.6.0 (NeMo's officially supported versions)
- Python 3.12.13 + PyTorch 2.7.0
- Python 3.12.13 + PyTorch 2.11.0
- Python 3.13.12 + PyTorch 2.8.0
- Multiple `batch_size` settings (1, 4, 8, 25)
- Different speaker embedding models (`titanet_large`, `ecapa_tdnn`)
- Single-scale vs multi-scale embedding extraction
- Different `infer_batch_size` values (1, 10, 25)
- `num_workers=0` to avoid multiprocessing issues

**Platform-specific notes:**
- The bug appears to be specific to macOS ARM64
- The issue is reproducible with multiple different audio files
- No configuration workaround has been found despite extensive testing

**Root cause analysis:**
The `batch` variable in `_fixed_seq_collate_fn` contains a 0-dimensional tensor instead of the expected list/tuple structure that can be unpacked with `zip(*batch)`. This suggests the dataloader's collate function is receiving malformed data from the dataset iterator during speaker embedding extraction. The successful completion of VAD and segmentation indicates the problem is isolated to the speaker model's test dataloader configuration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker Diarization - TypeError: iteration over a 0-d tensor in audio_to_label.py #15566

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speaker Diarization - TypeError: iteration over a 0-d tensor in audio_to_label.py #15566

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions