Skip to content

Speaker Diarization - TypeError: iteration over a 0-d tensor in audio_to_label.py #15566

@msp752

Description

@msp752

Describe the bug

Speaker diarization consistently fails during the embedding extraction phase with a TypeError: iteration over a 0-d tensor error in the _fixed_seq_collate_fn function. The error occurs after successful VAD (Voice Activity Detection) and speech segmentation, specifically when the speaker embedding model attempts to process the first batch of subsegments.

The error occurs in audio_to_label.py at line 127 where zip(*batch) is called, but the batch variable contains a 0-dimensional tensor instead of the expected list/tuple structure.

Steps/Code to reproduce bug

  1. Create a manifest file with audio chunks (WAV format, 16kHz, mono):
{"audio_filepath": "/path/to/chunk.wav", "offset": 0, "duration": 300.0, "label": "infer", "text": "-", "num_speakers": 6, "rttm_filepath": null, "uem_filepath": null}
  1. Run the following minimal code:
import nemo.collections.asr as nemo_asr
from nemo.collections.asr.models import ClusteringDiarizer
from omegaconf import OmegaConf

config = {
    'diarizer': {
        'manifest_filepath': '/path/to/manifest.json',
        'out_dir': '/output/dir',
        'oracle_vad': False,
        'oracle_num_speakers': False,
        'max_num_speakers': 6,
        'device': 'cpu',
        'batch_size': 1,

        'msdd_model': {
            'model_path': 'diar_msdd_telephonic',
            'parameters': {
                'use_speaker_model_from_ckpt': True,
                'infer_batch_size': 1,
                'sigmoid_threshold': [0.7],
            }
        },

        'speaker_embeddings': {
            'model_path': 'ecapa_tdnn',
            'parameters': {
                'window_length_in_sec': [1.5],
                'shift_length_in_sec': [0.75],
                'multiscale_weights': [1],
                'save_embeddings': False,
                'batch_size': 1,
                'device': 'cpu',
                'num_workers': 0,
            }
        },

        'clustering': {
            'parameters': {
                'oracle_num_speakers': False,
                'max_num_speakers': 6,
                'enhanced_count_thres': 80,
                'max_rp_threshold': 0.25,
                'sparse_search_volume': 30,
            }
        },

        'vad': {
            'model_path': 'vad_multilingual_marblenet',
            'parameters': {
                'window_length_in_sec': 0.15,
                'shift_length_in_sec': 0.01,
                'smoothing': 'median',
                'overlap': 0.875,
                'onset': 0.8,
                'offset': 0.6,
                'pad_onset': 0.05,
                'pad_offset': -0.05,
                'min_duration_on': 0.2,
                'min_duration_off': 0.2,
                'filter_speech_first': True,
            }
        }
    }
}

config_omega = OmegaConf.create(config)
OmegaConf.set_struct(config_omega, False)
config_omega.device = 'cpu'
config_omega.num_workers = 0
config_omega.sample_rate = 16000
config_omega.verbose = True

msdd_model = ClusteringDiarizer(cfg=config_omega)
msdd_model.diarize()  # Fails here

Expected behavior

Speaker diarization should complete successfully, producing RTTM files with speaker labels and timestamps.

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of NeMo install: pip install 'nemo_toolkit[asr]'
  • Not using Docker

Environment details

  • OS version: macOS 15.4 (Darwin 25.4.0) on Apple Silicon (ARM64)
  • PyTorch version: Tested with 2.6.0, 2.7.0, 2.11.0 (all fail)
  • Python version: Tested with 3.10.20, 3.12.13, 3.13.12 (all fail)
  • NeMo version: 2.7.2

Additional context

Error traceback:

[NeMo I 2026-03-30 17:45:03 clustering_diarizer:434] Extracting embeddings for Diarization
[NeMo I 2026-03-30 17:45:03 collections:803] Filtered duration for loading collection is  0.00 hours.
[NeMo I 2026-03-30 17:45:03 collections:803] Dataset successfully loaded with 1379 items and total duration provided from manifest is  0.38 hours.
[NeMo I 2026-03-30 17:45:03 collections:803] # 1379 files loaded accounting to # 1 labels
[1/1] extract embeddings:   0%|          | 0/1379 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "/tmp/nemo_speaker_diarization.py", line 157, in diarize_chunks
    msdd_model.diarize()
  File "/path/to/nemo/collections/asr/models/clustering_diarizer.py", line 434, in diarize
    self._extract_embeddings(self.subsegments_manifest_path, scale_idx, len(scales))
  File "/path/to/nemo/collections/asr/models/clustering_diarizer.py", line 344, in _extract_embeddings
    for test_batch in tqdm(
  File "/path/to/tqdm/std.py", line 1181, in __iter__
    for obj in iterable:
  File "/path/to/torch/utils/data/dataloader.py", line 708, in __next__
    data = self._next_data()
  File "/path/to/torch/utils/data/dataloader.py", line 764, in _next_data
    data = self._dataset_fetcher.fetch(index)
  File "/path/to/torch/utils/data/_utils/fetch.py", line 55, in fetch
    return self.collate_fn(data)
  File "/path/to/nemo/collections/asr/data/audio_to_label.py", line 461, in fixed_seq_collate_fn
    return _fixed_seq_collate_fn(self, batch)
  File "/path/to/nemo/collections/asr/data/audio_to_label.py", line 127, in _fixed_seq_collate_fn
    _, audio_lengths, _, tokens_lengths = zip(*batch)
  File "/path/to/torch/_tensor.py", line 1154, in __iter__
    raise TypeError("iteration over a 0-d tensor")
TypeError: iteration over a 0-d tensor

What works successfully:

  • VAD (Voice Activity Detection) completes successfully
  • Speech segment generation completes successfully (creates 1379 subsegments)
  • Models load successfully (vad_multilingual_marblenet, ecapa_tdnn, titanet_large)
  • NeMo ASR transcription works perfectly with the same audio files

Tested configurations (all fail with identical error):

  • Python 3.10.20 + PyTorch 2.6.0 (NeMo's officially supported versions)
  • Python 3.12.13 + PyTorch 2.7.0
  • Python 3.12.13 + PyTorch 2.11.0
  • Python 3.13.12 + PyTorch 2.8.0
  • Multiple batch_size settings (1, 4, 8, 25)
  • Different speaker embedding models (titanet_large, ecapa_tdnn)
  • Single-scale vs multi-scale embedding extraction
  • Different infer_batch_size values (1, 10, 25)
  • num_workers=0 to avoid multiprocessing issues

Platform-specific notes:

  • The bug appears to be specific to macOS ARM64
  • The issue is reproducible with multiple different audio files
  • No configuration workaround has been found despite extensive testing

Root cause analysis:
The batch variable in _fixed_seq_collate_fn contains a 0-dimensional tensor instead of the expected list/tuple structure that can be unpacked with zip(*batch). This suggests the dataloader's collate function is receiving malformed data from the dataset iterator during speaker embedding extraction. The successful completion of VAD and segmentation indicates the problem is isolated to the speaker model's test dataloader configuration.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions