Describe the bug
Speaker diarization consistently fails during the embedding extraction phase with a TypeError: iteration over a 0-d tensor error in the _fixed_seq_collate_fn function. The error occurs after successful VAD (Voice Activity Detection) and speech segmentation, specifically when the speaker embedding model attempts to process the first batch of subsegments.
The error occurs in audio_to_label.py at line 127 where zip(*batch) is called, but the batch variable contains a 0-dimensional tensor instead of the expected list/tuple structure.
Steps/Code to reproduce bug
- Create a manifest file with audio chunks (WAV format, 16kHz, mono):
{"audio_filepath": "/path/to/chunk.wav", "offset": 0, "duration": 300.0, "label": "infer", "text": "-", "num_speakers": 6, "rttm_filepath": null, "uem_filepath": null}
- Run the following minimal code:
import nemo.collections.asr as nemo_asr
from nemo.collections.asr.models import ClusteringDiarizer
from omegaconf import OmegaConf
config = {
'diarizer': {
'manifest_filepath': '/path/to/manifest.json',
'out_dir': '/output/dir',
'oracle_vad': False,
'oracle_num_speakers': False,
'max_num_speakers': 6,
'device': 'cpu',
'batch_size': 1,
'msdd_model': {
'model_path': 'diar_msdd_telephonic',
'parameters': {
'use_speaker_model_from_ckpt': True,
'infer_batch_size': 1,
'sigmoid_threshold': [0.7],
}
},
'speaker_embeddings': {
'model_path': 'ecapa_tdnn',
'parameters': {
'window_length_in_sec': [1.5],
'shift_length_in_sec': [0.75],
'multiscale_weights': [1],
'save_embeddings': False,
'batch_size': 1,
'device': 'cpu',
'num_workers': 0,
}
},
'clustering': {
'parameters': {
'oracle_num_speakers': False,
'max_num_speakers': 6,
'enhanced_count_thres': 80,
'max_rp_threshold': 0.25,
'sparse_search_volume': 30,
}
},
'vad': {
'model_path': 'vad_multilingual_marblenet',
'parameters': {
'window_length_in_sec': 0.15,
'shift_length_in_sec': 0.01,
'smoothing': 'median',
'overlap': 0.875,
'onset': 0.8,
'offset': 0.6,
'pad_onset': 0.05,
'pad_offset': -0.05,
'min_duration_on': 0.2,
'min_duration_off': 0.2,
'filter_speech_first': True,
}
}
}
}
config_omega = OmegaConf.create(config)
OmegaConf.set_struct(config_omega, False)
config_omega.device = 'cpu'
config_omega.num_workers = 0
config_omega.sample_rate = 16000
config_omega.verbose = True
msdd_model = ClusteringDiarizer(cfg=config_omega)
msdd_model.diarize() # Fails here
Expected behavior
Speaker diarization should complete successfully, producing RTTM files with speaker labels and timestamps.
Environment overview (please complete the following information)
- Environment location: Bare-metal
- Method of NeMo install:
pip install 'nemo_toolkit[asr]'
- Not using Docker
Environment details
- OS version: macOS 15.4 (Darwin 25.4.0) on Apple Silicon (ARM64)
- PyTorch version: Tested with 2.6.0, 2.7.0, 2.11.0 (all fail)
- Python version: Tested with 3.10.20, 3.12.13, 3.13.12 (all fail)
- NeMo version: 2.7.2
Additional context
Error traceback:
[NeMo I 2026-03-30 17:45:03 clustering_diarizer:434] Extracting embeddings for Diarization
[NeMo I 2026-03-30 17:45:03 collections:803] Filtered duration for loading collection is 0.00 hours.
[NeMo I 2026-03-30 17:45:03 collections:803] Dataset successfully loaded with 1379 items and total duration provided from manifest is 0.38 hours.
[NeMo I 2026-03-30 17:45:03 collections:803] # 1379 files loaded accounting to # 1 labels
[1/1] extract embeddings: 0%| | 0/1379 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/tmp/nemo_speaker_diarization.py", line 157, in diarize_chunks
msdd_model.diarize()
File "/path/to/nemo/collections/asr/models/clustering_diarizer.py", line 434, in diarize
self._extract_embeddings(self.subsegments_manifest_path, scale_idx, len(scales))
File "/path/to/nemo/collections/asr/models/clustering_diarizer.py", line 344, in _extract_embeddings
for test_batch in tqdm(
File "/path/to/tqdm/std.py", line 1181, in __iter__
for obj in iterable:
File "/path/to/torch/utils/data/dataloader.py", line 708, in __next__
data = self._next_data()
File "/path/to/torch/utils/data/dataloader.py", line 764, in _next_data
data = self._dataset_fetcher.fetch(index)
File "/path/to/torch/utils/data/_utils/fetch.py", line 55, in fetch
return self.collate_fn(data)
File "/path/to/nemo/collections/asr/data/audio_to_label.py", line 461, in fixed_seq_collate_fn
return _fixed_seq_collate_fn(self, batch)
File "/path/to/nemo/collections/asr/data/audio_to_label.py", line 127, in _fixed_seq_collate_fn
_, audio_lengths, _, tokens_lengths = zip(*batch)
File "/path/to/torch/_tensor.py", line 1154, in __iter__
raise TypeError("iteration over a 0-d tensor")
TypeError: iteration over a 0-d tensor
What works successfully:
- VAD (Voice Activity Detection) completes successfully
- Speech segment generation completes successfully (creates 1379 subsegments)
- Models load successfully (vad_multilingual_marblenet, ecapa_tdnn, titanet_large)
- NeMo ASR transcription works perfectly with the same audio files
Tested configurations (all fail with identical error):
- Python 3.10.20 + PyTorch 2.6.0 (NeMo's officially supported versions)
- Python 3.12.13 + PyTorch 2.7.0
- Python 3.12.13 + PyTorch 2.11.0
- Python 3.13.12 + PyTorch 2.8.0
- Multiple
batch_size settings (1, 4, 8, 25)
- Different speaker embedding models (
titanet_large, ecapa_tdnn)
- Single-scale vs multi-scale embedding extraction
- Different
infer_batch_size values (1, 10, 25)
num_workers=0 to avoid multiprocessing issues
Platform-specific notes:
- The bug appears to be specific to macOS ARM64
- The issue is reproducible with multiple different audio files
- No configuration workaround has been found despite extensive testing
Root cause analysis:
The batch variable in _fixed_seq_collate_fn contains a 0-dimensional tensor instead of the expected list/tuple structure that can be unpacked with zip(*batch). This suggests the dataloader's collate function is receiving malformed data from the dataset iterator during speaker embedding extraction. The successful completion of VAD and segmentation indicates the problem is isolated to the speaker model's test dataloader configuration.
Describe the bug
Speaker diarization consistently fails during the embedding extraction phase with a
TypeError: iteration over a 0-d tensorerror in the_fixed_seq_collate_fnfunction. The error occurs after successful VAD (Voice Activity Detection) and speech segmentation, specifically when the speaker embedding model attempts to process the first batch of subsegments.The error occurs in
audio_to_label.pyat line 127 wherezip(*batch)is called, but thebatchvariable contains a 0-dimensional tensor instead of the expected list/tuple structure.Steps/Code to reproduce bug
{"audio_filepath": "/path/to/chunk.wav", "offset": 0, "duration": 300.0, "label": "infer", "text": "-", "num_speakers": 6, "rttm_filepath": null, "uem_filepath": null}Expected behavior
Speaker diarization should complete successfully, producing RTTM files with speaker labels and timestamps.
Environment overview (please complete the following information)
pip install 'nemo_toolkit[asr]'Environment details
Additional context
Error traceback:
What works successfully:
Tested configurations (all fail with identical error):
batch_sizesettings (1, 4, 8, 25)titanet_large,ecapa_tdnn)infer_batch_sizevalues (1, 10, 25)num_workers=0to avoid multiprocessing issuesPlatform-specific notes:
Root cause analysis:
The
batchvariable in_fixed_seq_collate_fncontains a 0-dimensional tensor instead of the expected list/tuple structure that can be unpacked withzip(*batch). This suggests the dataloader's collate function is receiving malformed data from the dataset iterator during speaker embedding extraction. The successful completion of VAD and segmentation indicates the problem is isolated to the speaker model's test dataloader configuration.