Mingyang Wu1 Ashirbad Mishra2 Soumik Dey2 Shuo Xing1 Naveen Ravipati2
Hansi Wu2 Binbin Li2 Zhengzhong Tu1†
1Texas A&M University 2eBay Inc.
†Corresponding Author
- 2026.03: Our code and dataset are under internal review.
- 2026.02: Our paper is accepted by CVPR 2026.
We will release resources in stages after internal review. Please stay tuned.
- ConsID-Gen inference/training code
- ConsIDVid dataset: https://huggingface.co/datasets/mingyang-wu/ConsIDVid
- Model checkpoints
conda create -n considgen python=3.10
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu126
pip install -r requirements.txtUse this section to download the model weights required by run_inference_considgen.py.
# 1) Install Hugging Face CLI
pip install "huggingface_hub[cli]"
# 2) Download Wan2.1-Fun-1.3B-InP
mkdir -p ./models/PAI/Wan2.1-Fun-1.3B-InP
hf download "alibaba-pai/Wan2.1-Fun-1.3B-InP" --local-dir "./models/PAI/Wan2.1-Fun-1.3B-InP"
# 3) Download VGGT-1B
mkdir -p ./models/VGGT-1B
hf download "facebook/VGGT-1B" --local-dir "./models/VGGT-1B"
# 4) Download ConsID-Gen
hf download "mingyang-wu/ConsID-Gen" --local-dir "./models/ConsID-Gen/checkpoints"Google Drive mirror for ConsID-Gen checkpoints.
Run single-image conditioned generation with a finetuned checkpoint:
python run_inference_considgen.py \
--input_image_path /path/to/input_image.jpg \
--image_dir /path/to/multi_view_images_dir \
--prompt "A product-style close-up video with stable lighting and clean background." \
--output_dir ./tmp \
--checkpoint_path models/train/ConsID-Gen/model.safetensorsPrepare the training dataset.
The training metadata should be a JSON list. Each sample needs:
video: path to the training videoprompt: text description for the videoimage_list: list of multi-view reference image paths
Example metadata structure:
[
{
"dataset": "example_dataset_name",
"video": "/path/to/video.mp4",
"prompt": "A short text description of the target video.",
"image_list": [
"/path/to/view_1.jpg",
"/path/to/view_2.jpg"
]
}
]CUDA_VISIBLE_DEVICES=3 python run_train_considgen.py \
--dataset_metadata_path ./example_metadata.json \
--model_paths '["models/PAI/Wan2.1-Fun-1.3B-InP/diffusion_pytorch_model.safetensors","models/PAI/Wan2.1-Fun-1.3B-InP/models_t5_umt5-xxl-enc-bf16.pth","models/PAI/Wan2.1-Fun-1.3B-InP/Wan2.1_VAE.pth","models/PAI/Wan2.1-Fun-1.3B-InP/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth"]' \
--vggt_model_path models/VGGT-1B \
--tokenizer_path models/PAI/Wan2.1-Fun-1.3B-InP/google/umt5-xxl \
--trainable_models dit,considgen_adapter \
--output_path models/ConsID-Gen/example \
--num_epochs 1 \
--dataset_num_workers 0 \
--wandb_mode disabled@misc{wu2026considgenviewconsistentidentitypreservingimagetovideo,
title={ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation},
author={Mingyang Wu and Ashirbad Mishra and Soumik Dey and Shuo Xing and Naveen Ravipati and Hansi Wu and Binbin Li and Zhengzhong Tu},
year={2026},
eprint={2602.10113},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.10113},
}