Skip to content

Confusing error when multiple dataframe columns have the same name #11876

@Yura52

Description

@Yura52

Description

The following snippet is a minimal example demonstrating the issue:

import numpy as np
import pandas as pd
from xgboost import DMatrix

df_numeric = pd.DataFrame(np.random.randn(10, 2))
df_categorical = pd.DataFrame(np.random.randint(0, 2, (10, 2))).astype('category')

df = pd.concat(
    [df_numeric, df_categorical],
    axis=1,
    # ignore_index=True  # <-- Uncomment to fix the issue
)
DMatrix(df, enable_categorical=True)

The above code triggers the following exception:

...
AttributeError: 'DataFrame' object has no attribute 'dtype'

The above error message does not clearly point to the root cause of the exception. Note that passing ignore_index=True to pd.concat fixes the issue, so it seems that columns with the same names (which happens without passing ignore_index) is the problem. My suggestion is to raise a more user-friendly exception in this scenario.

Software

python==3.12.9
xgboost==3.1.2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions