This repository contains the official implementation of
"Benchmarking Extreme Multi-Label Classification for Semantic Annotation with Multi-Taxonomy Datasets"
Pietro Caforio*, Christophe Broillet*, Philippe Cudré-Mauroux, Julien Audiffren (University of Fribourg)
*Equal contribution
Extreme Multi-Label Classification (XMLC) is the task of predicting relevant labels from massive tag sets. While many large label collections are organized into taxonomies (hierarchical relationships), and recent state-of-the-art methods have shown that leveraging this structure significantly boosts performance, comprehensive evaluation of XMLC remains a challenge. This makes it difficult to properly compare state-of-the-art XMLC methods, highlight their strengths and limitations, and isolate the influence of taxonomy from other dataset characteristics.
For these reasons, we introduce OAXMLC, a comprehensive benchmark designed to evaluate how XMLC algorithms leverage taxonomic information across multiple tasks:
- Classification
- Sub-category analysis
- Completion
- Few-shot learning
To achieve this, we produced two new large XMLC datasets, each featuring two distinct sets of labels and taxonomies. By benchmarking a wide range of recent XMLC methods (both taxonomy-aware and taxonomy-agnostic), we analyze how tasks, datasets, and taxonomic properties impact model performance.
A comprehensive collection of metrics and detailed experimental results is available at:
https://exascaleinfolab.github.io/oaxmlc-bench/
This framework allows benchmarking of the main state-of-the-art XMLC methods on classic XMLC datasets (MAG-CS, EURLex, ...) as well as multi-taxonomy datasets (OAXMLC, OAMED-XMLC), using both taxonomy-aware and taxonomy-agnostic algorithms.
Currently implemented methods are:
| Method | Venue / Publication | Year | Algorithm Type |
|---|---|---|---|
| MATCH [1] | WWW | 2021 | Deep learning, taxonomy-aware (Transformer) |
| XML-CNN [2] | SIGIR | 2017 | Deep learning (CNN-based) |
| AttentionXML [3] | NeurIPS | 2019 | Deep learning, label-tree attention |
| FastXML [4] | KDD | 2016 | Tree-based, non-deep learning |
| HECTOR [5] | WWW | 2024 | Deep learning, taxonomy-aware (Seq2Seq) |
| TAMLEC [6] | CIKM | 2025 | Deep learning, taxonomy-aware (parallel / path-based) |
| LightXML [7] | AAAI | 2021 | Deep learning (Transformer, negative sampling) |
| CascadeXML [8] | NeurIPS | 2022 | Deep learning (multi-resolution Transformer) |
| Parabel [9] | WWW | 2018 | Tree-based, embedding-based |
| NGAME [10] | WSDM | 2023 | Deep learning, Siamese / metric learning |
| DEXA [11] | KDD | 2023 | Deep learning, Siamese with auxiliary parameters |
| ICXML [12] | NAACL | 2024 | LLM, in-context candidate generation and re-ranking |
Create the Conda environment:
conda env create -f environment.yml
conda activate xmlc
pip install -r requirements.txtFastXML requires Cython compilation:
cd models/FastXML
python -m pip install -e . --no-build-isolationHECTOR and TAMLEC require pretrained GloVe embeddings. We use the GloVe.840B.300d version, available from the official website: https://nlp.stanford.edu/projects/glove/.
The downloaded file must be placed in the .vector_cache directory at the root of the repository. The path to the pretrained embeddings (parameter path_to_glove) can be modified in algorithms/hector.py and algorithms/tamlec.py.
exascaleinfolab-xmlc-fewshot/
├── algorithms/ # Algorithm wrappers (one file per method)
├── models/ # Original or adapted model implementations
├── datahandler/ # Datasets, taxonomies, samplers
├── configs/ # Experiment configuration files
├── misc/ # Metrics, utilities, experiment driver
├── environment.yml
├── environment_macos.yml
└── README.md
The two new multi-taxonomy datasets introduced by the benchmark, OAXMLC and OAMED-XMLC, can be downloaded from:
For each dataset, a documents.json file is provided, along with concepts.zip and topics.zip archives. These contain:
ontology.json: label titles and descriptionstaxonomy.txt: taxonomy structure
Files must be placed in the datasets/ directory as follows:
datasets/
├── oamedtopics/
│ ├── documents.json
│ ├── taxonomy.txt
│ └── ontology.json
├── oamedconcepts/
│ ├── documents.json
│ ├── taxonomy.txt
│ └── ontology.json
├── oaxmlc_topics/
│ ├── documents.json
│ ├── taxonomy.txt
│ └── ontology.json
└── oaxmlc_concepts/
├── documents.json
├── taxonomy.txt
└── ontology.json
Additional datasets can be downloaded from:
- MAG-CS: https://drive.google.com/file/d/1P_MWGSy0JVq-nPpNQNtD6VbnF3gQfd3C/view
- PubMed: https://drive.google.com/file/d/1I-PnPhTF81G5lXi_GhEXe2gJJ-zawKqT/view
- EURLex: https://drive.google.com/file/d/1uBH-o_kZbKJxXhtFdWnzutuGHUGgcK9U/view
The entry point for all experiments is the configuration file corresponding to the desired dataset and algorithm.
python configs/{dataset}_{algorithm}.py --train --seed <seed> --device <device>Example:
python configs/oamedconcepts_tamlec.py --train --seed 42 --device cuda:0Results are saved in output/{dataset}_{algorithm}{seed}/.
python configs/{dataset}_{algorithm}.py --train --fewshot --seed <seed> --device <device>Example:
python configs/oamedconcepts_tamlec.py --train --fewshot --seed 42 --device cuda:0Results are saved in output/{dataset}_{algorithm}_fewshot{seed}/.
After training a standard classification model, completion experiments can be run as follows:
python configs/{dataset}_{algorithm}.py --completion --seed <seed> --device <device>Example:
python configs/oamedconcepts_tamlec.py --completion --seed 42 --device cuda:0Completion metrics are appended to the corresponding classification output folder.
Further details on configuration files are available in configs/readme.md.
New methods can be added by creating a new file in the algorithms/ directory. The recommended approach is to inherit from algorithms/base_algorithm.py and override the following methods:
__init__(self, config, ...)run_init(self)optimization_loop(self, input_data)inference_eval(self, input_data)load_model(self)
See algorithms/xmlcnn.py for a minimal reference implementation.
[1] Zhang, Y., Shen, Z., Dong, Y., Wang, K., & Han, J. (2021, April). MATCH: Metadata-aware text classification in a large hierarchy. In Proceedings of the Web Conference 2021 (pp. 3246-3257).
[2] Liu, J., Chang, W. C., Wu, Y., & Yang, Y. (2017, August). Deep learning for extreme multi-label text classification. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval (pp. 115-124).
[3] You, R., Zhang, Z., Wang, Z., Dai, S., Mamitsuka, H., & Zhu, S. (2019). Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. Advances in neural information processing systems, 32.
[4] Jain, H., Prabhu, Y., & Varma, M. (2016, August). Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 935-944).
[5] Ostapuk, N., Audiffren, J., Dolamic, L., Mermoud, A., & Cudré-Mauroux, P. (2024, May). Follow the Path: Hierarchy-Aware Extreme Multi-Label Completion for Semantic Text Tagging. In Proceedings of the ACM on Web Conference 2024 (pp. 2094-2105).
[6] Audiffren, J., Broillet, C., Dolamic, L., & Cudré-Mauroux, P. (2024). Extreme Multi-label Completion for Semantic Document Labelling with Taxonomy-Aware Parallel Learning. arXiv preprint arXiv:2412.13809.
[7] Jiang, T., Wang, D., Sun, L., Yang, H., Zhao, Z., & Zhuang, F. (2021, May). Lightxml: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In Proceedings of the AAAI conference on artificial intelligence (Vol. 35, No. 9, pp. 7987-7994).
[8] Kharbanda, S., Banerjee, A., Schultheis, E., & Babbar, R. (2022). Cascadexml: Rethinking transformers for end-to-end multi-resolution training in extreme multi-label classification. Advances in neural information processing systems, 35, 2074-2087.
[9] Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., & Varma, M. (2018, April). Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference (pp. 993-1002).
[10] Dahiya, K. and Gupta, N. and Saini, D. and Soni, A. and Wang, Y. and Dave, K. and Jiao, J. and Gururaj, K. and Dey, P. and Singh, A. and Hada, D. and Jain, V. and Paliwal, B. and Mittal, A. and Mehta, S. and Ramjee, R. and Agarwal, S. and Kar, P. and Varma, M. (2023, March). NGAME: Negative mining-aware mini-batching for extreme classification. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (pp. 258–266)
[11] Kunal Dahiya, Sachin Yadav, Sushant Sondhi, Deepak Saini, Sonu Mehta, Jian Jiao, Sumeet Agarwal, Purushottam Kar, and Manik Varma. 2023. Deep Encoders with Auxiliary Parameters for Extreme Classification. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '23). Association for Computing Machinery, New York, NY, USA, 358–367
[12] Zhu, Y., & Zamani, H. (2024, June). ICXML: An In-Context Learning Framework for Zero-Shot Extreme Multi-Label Classification. In Findings of the Association for Computational Linguistics: NAACL 2024 (pp. 2086-2098).