Skip to content

EntityNet: Using Knowledge Graphs to harvest datasets for efficient CLIP model training

License

Notifications You must be signed in to change notification settings

lmb-freiburg/entitynet

Repository files navigation

Computer Vision, University of Freiburg
paper — arXiv — project page — 🤗 models — 🤗 dataset

EntityNet

Using Knowledge Graphs to harvest datasets for efficient CLIP model training

We use knowledge graphs and web image search to build a diverse dataset of 33M images paired with 46M texts. We show that this dataset can be used to train a generic CLIP model in a short amount of time.

Using a 10M-image subset focused on living organisms, we train domain expert models that excel at fine-grained classification of animals, plants, and fungi.

Dataset and training code are released!

News

Release roadmap

  • Publish preprint
  • Upload CLIP models on huggingface
  • Add evaluation code
  • Add training dataset
    • Add code to download images from URLs and create webdataset
    • Add dataloader for the webdataset
  • Add model training code
  • Code to download images using image search
  • Code to download entities from WikiData

Models

Our CLIP models are available on 🤗 Hugging Face

Models are named as follows:

  • Architecture
  • Trained on EntityNet-33M (all images) or trained on LivingThings-10M (only trained on images of living organisms)
  • Pretrained from scratch or finetuned

Usage with open_clip

Models can be used with open_clip as follows:

import torch
from PIL import Image
import open_clip

model, _, preprocess = open_clip.create_model_and_transforms("hf-hub:lmb-freiburg/CLIP-ViT-B-16-EntityNet-33M")
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-B-16', context_length=32)

image = preprocess(Image.open("assets/images_cc0/rabbit.png")).unsqueeze(0)
texts = ["a dog", "a cat", "a rabbit"]
tokens = tokenizer(texts)

with torch.no_grad(), torch.autocast("cuda"):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    logits = (model.logit_scale * image_features @ text_features.T)
    pred_class = logits.argmax(-1).item()

print(texts[pred_class])  # prints: a rabbit

Repository structure

All commands should be run from the repository root.

configs/  yaml files for the experiments:
  configs/projects/group/name.yaml
src/
  clip_benchmark/  # benchmark datasets from https://github.com/LAION-AI/CLIP_benchmark
  entitynet/  # entitynet related code
  open_clip/  #  clip model, loss etc from https://github.com/mlfoundations/open_clip

Setup

  • Currently running with python=3.12 torch=2.6 cuda=12.4
  • Setup the paths with environment variables ENTITYNET_DATA_DIR and ENTITYNET_OUTPUT_DIR
# 0. setup conda environment
conda update conda -n base -y

_ENV=entitynet
conda create -n entitynet python=3.12 -y
conda activate entitynet
pip install torch torchvision
pip install -U -r requirements.txt
pip install -e .

python -m entitynet.cli.print_paths

Build EntityNet dataset from scratch

  • See the kgraph directory on how to query the knowledge graph for the entities, and how to build search queries using the entities.
  • Next, feed the search queries into an image search engine, download images and alt texts, deduplicate images, and remove contamination with downstream test sets (code to be released)

Setup EntityNet dataset from URLs

Note that the default settings for the scripts below is to download the validation set only.

Use --splits val,test,minitrain,train to download all splits. Default is to download only minitrain,val.

python -m entitynet.cli.urlbuild_step1_metadata
python -m entitynet.cli.urlbuild_step2_images
python -m entitynet.cli.urlbuild_step3_tars

Prepare evaluation datasets

  • Put all data into the path defined by environment variable ENTITYNET_DATA_DIR
  • Dataset building code is in src/clip_benchmark/datasets/builder.py and src/entitynet/dataset_factory.py
# ----- ImageNet ILSVRC2012 and robustness benchmarks
echo setup files for ImageNet ILSVRC2012 manually as follows.
# imagenet1k/
#   train/n########/*.JPEG
#   val/n########/*.JPEG

# run the tests for the evaluation tasks, this will download the remaining datasets.
python -m pytest tests/entitynet/test_eval_tasks_imagenet.py -sx

# # resulting file structure
# clip_benchmark/imagenetv2/ImageNetV2-matched-frequency/*/*.jpeg
# clip_benchmark/imagenet-r/n********/*.jpg
# clip_benchmark/objectnet/
#   objectnet-1.0/images/<ClassName>/*.png
#   folder_to_objectnet_label.json
#   imagenet_to_label_2012_v2
#   objectnet_to_imagenet_1k.json
#   pytorch_to_imagenet_2012_id.json
# clip_benchmark/imagenet_sketch/n********/*.JPEG
# clip_benchmark/imagenet-a/n********/*.jpg

# ----- iNaturalist
./bash_scripts/dataset/setup_inaturalist.sh
# create the traindev splits
python -m entitynet.cli.datasets.inat19_generate_splits
python -m entitynet.cli.datasets.inat21_generate_splits

# # resulting file structure
# iNat/2019/
#   train_val2019/<Category>/<ClassId>/*.jpg
#   categories.json
#   train2019.json
#   val2019.json
#   inat2019_traindev_split.json
#   inat2019_trainnodev_split.json
# iNat/2021
#   train/<ClassId>_<FullName>/*.jpg
#   val/<ClassId>_<FullName>/*.jpg
#   train.json
#   val.json
#   inat2021_traindev_split.json
#   inat2021_trainnodev_split.json

python -m pytest tests/entitynet/test_eval_tasks_inat.py -sx

# ----- CUB
./bash_scripts/dataset/setup_cub.sh

# # resulting file structure
# cub200/CUB_200_2011/
#   images/<Nr>.<ClassName>/*.jpg
#   train_test_split.txt

python -m pytest tests/entitynet/test_eval_tasks_cub.py -sx

# ----- RareSpecies
# run test, it will setup the dataset
python -m pytest tests/entitynet/test_eval_tasks_rarespecies -sx

# # resulting file structure
# rare-species/
#   images/<FullName>/*.jpg

# ----- Coco
./bash_scripts/dataset/setup_coco.sh

# # resulting file structure
# coco/
#   splits_karpathy/
#     coco_karpathy_test.json
#     coco_karpathy_train.json
#     coco_karpathy_val.json

# run test
python -m pytest tests/entitynet/test_eval_tasks_coco.py -sx

# ----- Flickr30K
# run the test, it will setup the dataset
python -m pytest tests/entitynet/test_eval_tasks_flickr.py -sx

# # resulting file structure
# clip_benchmark/flickr30k/
#   images/*.jpg
#   flickr30k_<Split>_karpathy.txt

# ----- XM3600
# run the test, it will setup the dataset
python -m pytest tests/entitynet/test_eval_tasks_xm3600.py -sx

# # resulting file structure
# clip_benchmark/crossmodal3600/
#   xm3600/
#     captions.jsonl
#     images/*.jpg
#   crossmodal3600_captions-en.json

Usage

Evaluation

# start one of the configs. you can either modify the yaml or modify the config on the fly with -o.
# by default it will run ALL tasks listed in eval_list_all.yaml
# to run only on one task or a smaller task list, use an argument like:
-o trainer.test_task_keys=imgn_1k_val
# or:
-o trainer.test_task_keys=task_list::eval_list_objcls_imgn
# use comma to separate multiple task keys.

python -m entitynet.cli.run configs/projects/eval_clip_entitynet_zs/eval_clip_vitb32_entitynet33m.yaml --run_id e1 --test_only

# view results:
python -m entitynet.cli.view_results -s eval_clip_entitynet_zs
# use script entitynet.cli.view_results to show the metrics.
# note that in our paper we test with different prompts and report the best result over all prompts.

Training

Assuming you have built EntityNet from the given URLs as described above.

# start the experiment and give it a unique run id
# the run id will determine the subfolder and names it in the online logger
python -m entitynet.cli.run configs/projects/subdirectory/config_name.yaml --run_id myrun123 --vislogger wandb

# result will be in path
cd ${ENTITYNET_OUTPUT_DIR}/experiments/subdirectory/config_name/myrun123

# Use `-o` to change the config on the fly
-o trainer.devices=1  # modify the number of gpus to 1
-o trainer.num_sanity_val_steps=0  # disable the validation sanity check before training

# useful for debugging
-o train_task.dataset.max_shards=32  # smaller train set, minimum is n_gpus * n_dataloader_workers
-o train_task.test_task_keys=imgn_1k_val,  # only test on imagenet. note the , at the end to make it a list

--vislogger csv  # to log locally instead of wandb
--run_val  # evaluate epoch 0 and exit, useful for finetuning
--test_only  # disable creating the train dataset and just run the test
  • Logger: By default, will log to CSV. To use wandb, set the environment variables WANDB_API_KEY=afe6... WAND_PROJECT=theproject and use flag --vislogger wandb. (We deprecated neptune.ai because it can lead to crashes in case of internet problems, but in the code also supports it)

Development

Code style

Acknowledgements

Please refer to our paper's acknowledgement section. Additionally we would like to acknowledge:

Citation

Citation will be updated after the conference proceedings are released.

@inproceedings{ging2025entitynet,
  author    = {Simon Ging and Sebastian Walter and Jelena Bratulić and Johannes Dienert and Hannah Bast and Thomas Brox},
  title     = {Using Knowledge Graphs to Harvest Datasets for Efficient CLIP Model Training},
  booktitle = {Pattern Recognition - 46th {DAGM} German Conference, {DAGM} {GCPR} 2025, Freiburg, Germany, September 23-26, 2025, Proceedings), 2025},
  series    = {Lecture Notes in Computer Science},
  publisher = {Springer},
  year      = {2025},
  note      = {To appear},
}

About

EntityNet: Using Knowledge Graphs to harvest datasets for efficient CLIP model training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published