Skip to content

🐞 Default MVTecAD fails to build datasets #3581

@rogergheser

Description

@rogergheser

Describe the bug

When using the MVTecAD dataset I simply load it up, use the prepare_data and setup and when I call the train_dataloader on top it seems like the data is empty. This is due to a bug related to the interaction between the Split Enum and Pandas.

Error statement

Zero subset length encountered during splitting. This means one of your subsets
            might be empty or devoid of either normal or anomalous images.
Traceback (most recent call last):
  File "/Users/amirgheser/SSA-IAD/bug.py", line 9, in <module>
    for i, data in enumerate(datamodule.train_dataloader()):
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/amirgheser/SSA-IAD/.venv/lib/python3.12/site-packages/anomalib/data/datamodules/base/image.py", line 373, in train_dataloader
    return DataLoader(
           ^^^^^^^^^^^
  File "/Users/amirgheser/SSA-IAD/.venv/lib/python3.12/site-packages/torch/utils/data/dataloader.py", line 394, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore[arg-type]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/amirgheser/SSA-IAD/.venv/lib/python3.12/site-packages/torch/utils/data/sampler.py", line 149, in __init__
    raise ValueError(

Proposed fix

In the make_mvtec_ad_dataset function simply add this line or anything with a similar logic.

    split = split.value if isinstance(split, Split) else split

Basically the dataframe is being fully and correctly handled until we reach the last if statement if split:
where the dataframe is finally being emptied because the equality between enum and string is always False.

Dataset

MVTecAD

Model

N/A

Steps to reproduce the behavior

Executing this simple script to check that the dataloader is working fails

if __name__ == "__main__":
    from anomalib.data import MVTecAD
    
    datamodule = MVTecAD(category="bottle")
    datamodule.prepare_data()
    datamodule.setup()
    
    for i, data in enumerate(datamodule.train_dataloader()):
        print(data.keys())
        print(data["image"].shape)
        break

OS information

OS information:

  • OS: [e.g. Ubuntu 20.04]
  • Python version: [e.g. 3.10.0]
  • Anomalib version: [e.g. 0.3.6]
  • PyTorch version: [e.g. 1.9.0]
  • CUDA/cuDNN version: [e.g. 11.1]
  • GPU models and configuration: [e.g. 2x GeForce RTX 3090]
  • Any other relevant information: [e.g. I'm using a custom dataset]

Expected behavior

No error message :)

Screenshots

No response

Pip/GitHub

pip

What version/branch did you use?

2.4.0

Configuration YAML

.

Logs

.

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions