Layer-aware TDNN: Speaker Recognition Using Multi-Layer Features from Pre-Trained Models (2026 ICAIIC)
We share a PyTorch implementation of our experiments here.
The figure above overviews the proposed framework (a & c) to pool speaker embedding from the multi-layered nature of pre-trained models.- We recommend you to visit Previous Versions (v1.12.0) for PyTorch installation including torchaudio==0.12.0.
[2025.12.17 update] A new code implementation is powered by PyTorch v2.6.0.
Use the requirements.txt to install the rest of the Python dependencies.
Ubuntu-Soundfile and conda-ffmpeg packages would be required for downloading and preprocessing data, and you can install them as below.
$ pip install -r requirements.txt
$ apt-get install python3-soundfile
$ conda install -c conda-forge ffmpegThe datasets can be downloaded from here:
-
VoxCeleb 1 & 2
We use clovaai/voxceleb_trainer to download VoxCeleb datasets, released under the MIT licence.
Follow the data preparation script, till you convert the VoxCeleb 2 audio formataac(.m4a)into.wavfile.
The following scripts are to preprocess audio data and build evaluation trials from each dataset.
- However, you can skip the "# set split" part, since we have uploaded the ready-made splits and trials used for our experiments in the file tree.
Please check the contents in data/VCTK-Corpus/, data/LibriSpeech/, data/VoxCeleb/ [;speakers/, ;trials/] first.
# preprocessing
$ python ./src/preprocess/process-VCTK.py --read_path SRC_PATHRemove speaker [p280, p315] of risen technical issues.
Drop samples (no.000~no.024), where the same transcript is used for each number.
Resample audio sources to meet the sample rate in common (48K → 16K).
# set split
$ python ./src/preprocess/split-VCTK-0-speakers.py
$ python ./src/preprocess/split-VCTK-1-rawtrials.py
$ python ./src/preprocess/split-VCTK-2-balancedtrials.pySubset the total speaker pool into train, validation, and test speaker subsets.
Check the match of speaker meta-info (Gender | Age | Accents | Region | Label) given the total combination.
Sample the trials with a balance to the label distribution and meta-info matches.
# preprocessing
$ python ./src/preprocess/process-LibriSpeech.py --read_path SRC_PATHConvert audio format
.flacto.wavfile.
# set split
$ python ./src/preprocess/split-LibriSpeech-1-rawtrials.py
$ python ./src/preprocess/split-LibriSpeech-2-balancedtrials.pyCheck the match of speaker meta-info (Gender(SEX) | Label) given the total combination of samples.
Sample the trials with a balance to the label distribution and meta-info matches.
$ mv ./data/VoxCeleb/*_wav/ ./data/VoxCeleb/preprocess/No special data preprocessing required.
# set split
$ python ./src/preprocess/split-VoxCeleb-0-speakers.py
$ python ./src/preprocess/split-VoxCeleb-2-balancedtrials.pyList up the speakers in each subsets, and convert 'Vox1-O' evaluation path file format
.txtto.csv.
Sample the trials with a balance to the label distribution.
Loggings, weights, and training configurations will be saved under res/ directory.
The result folder will be created in local-YYYYMMDD-HHmmss format by default.
To use neptune.ai logging, set your configuration in src/configs/neptune/neptune-logger-config.yaml and add --neptune in command line.
The experiment ID created at your neptune.ai [project] will be the name of the output directory.
- General usage examples,
# Running directly through the command line
$ CUDA_VISIBLE_DEVICES=0 python ./src/main.py train VCTK UniPool --use_pretrain --frz_pretrain --batch_size 128 --seed 9973 --backbone_cfg facebook/wav2vec2-base --nb_total_step 10000 --nb_steps_eval 1000;
# Or you can use a shell file for the multiple commands.
$ ./src/run.sh- Adjusting hyperparameters directly by command.
$ python ./src/main.py -h
usage: main.py [1-action] [2-data] [3-model folder] [-h]
positional arguments (required):
[1] action: {train,eval}
[2] data : {VCTK,LibriSpeech,Vox1,Vox2}
[3] model : {X-vector,ECAPA-TDNN,SincNet,ExploreWV2,FinetuneWV2,L-TDNN_ECAPA}
optional arguments in general:
-h, --help show this help message and exit
--quickrun quick check for the running experiment on the modification, set as True if given
--skiptest skip evaluation for the testset during training, set as True if given
--neptune log experiment with neptune logger, set as True if given
--workers WORKERS the number of cpu-cores to use from dataloader (per each device), defaults to: 4
--device [DEVICE0,] specify the list of index numbers to use the cuda devices, defaults to: [0]
--seed SEED integer value of random number initiation, defaults to: 42
--eval_path EVAL_PATH result path to load model on the "action" given as {eval}, defaults to: None
--description DESCRIPTION user parameter for specifying certain version, Defaults to "Untitled".
keyword arguments:
--kwarg KWARG dynamically modifies any of the hyperparameters declared in ../configs/.../...*.yaml or ./benchmarks/...*.yaml
(e.g.) --lr 0.001 --batch_size 64 --nb_total_step 25000 ...- The following command line will conduct the test evaluation with the best-validated model parameter from the configuration of DIR_NAME
CUDA_VISIBLE_DEVICES=0 python ./src/main.py eval _ _ --eval_path DIR_NAME;- You can also conduct the cross-dataset evaluation by modifying the command like this.
CUDA_VISIBLE_DEVICES=0 python ./src/main.py eval Vox1 _ --eval_path DIR_NAME;@article{kim2024layer,
title={Layer-aware TDNN: Speaker Recognition Using Multi-Layer Features from Pre-Trained Models},
author={Kim, Jin Sob and Park, Hyun Joon and Shin, Wooseok and Yun, Juan and Han, Sung Won},
journal={arXiv preprint arXiv.2409.07770v2},
year={2024}
}
@article{kim2024universal,
title={Universal Pooling Method of Multi-layer Features from Pretrained Models for Speaker Verification},
author={Kim, Jin Sob and Park, Hyun Joon and Shin, Wooseok and Han, Sung Won},
journal={arXiv preprint arXiv.2409.07770},
year={2024}
}This repository is released under the MIT license.
As we commented from each comparison src/benchmarks/ models, the projects below are referred to reproduce the implementations.
"Official" means that the author of the paper released the project.
Open to the public
X-vector/: cvqluu/TDNN
Released under MIT license
SincNet/: mravanelli/SincNet (Official)ECAPA-TDNN/: TaoRuijie/ECAPA-TDNNFinetuneWV2/: nikvaessen/w2v2-speaker (Official)
Some of the src/utils/ are quoted from below.
metrics.py: speechbrain/utils/metric_stats.pysampler.py: SeungjunNah/DeepDeblur-PyTorch/src/data/sampler.py
