Skip to content

Can not pass feature_groups over to fit method when using GridSearchCV + GRCCA #202

@JohannesWiesner

Description

@JohannesWiesner

Hi James, this might be related to #150. I would like to use GridSearchCV in combination with GRCCA but I cannot find a way to pass the feature groups over to the .fit() method of GRCCA.

Currently I am getting:

ValueError: 
All the 40 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 886, in _fit_and_score
    estimator.fit(X_train, **fit_params)
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/pipeline.py", line 468, in fit
    routed_params = self._check_method_params(method="fit", props=params)
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/pipeline.py", line 374, in _check_method_params
    fit_params_steps[step]["fit"][param] = pval
  File "/zi/home/johannes.wiesner/micromamba/envs/csp_wiesner_johannes/lib/python3.9/site-packages/sklearn/utils/_bunch.py", line 39, in __getitem__
    return super().__getitem__(key)
KeyError: 'grcca'

Here's some example code:

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from cca_zoo.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from cca_zoo.preprocessing import MultiViewPreprocessing
from sklearn.preprocessing import StandardScaler
from cca_zoo.linear import GRCCA

###############################################################################
## Simulate Data: Not part of the question
###############################################################################

# Random state for reproducibility
rng = np.random.RandomState(42)

# Parameters
n_samples = 100
n_features_X = 100
n_features_Y = 10
latent_correlation = 0.6

# Generate a latent variable
latent_dim = 1
latent_variable = rng.randn(n_samples, latent_dim)

# Generate X with structured covariance
# Define groups
group_sizes = [50, 25, 25]
group_correlations = [0.8, 0.7, 0.6]
X = np.zeros((n_samples, n_features_X))
current_feature = 0

for group_size, group_corr in zip(group_sizes, group_correlations):
    
    # Generate a group latent variable
    group_latent = latent_variable + rng.randn(n_samples, 1) * (1 - group_corr)
    
    # Generate group features
    group_features = group_latent @ rng.randn(1, group_size) + rng.randn(n_samples, group_size) * (1 - group_corr)
    X[:, current_feature:current_feature + group_size] = group_features
    current_feature += group_size

# Generate Y based on the latent variable
Y = latent_variable @ rng.randn(1, n_features_Y) + rng.randn(n_samples, n_features_Y) * (1 - latent_correlation)

###############################################################################
## Bring data in nice format: Not part of the question
###############################################################################

subject_ids = [f"subject_{i+1}" for i in range(n_samples)]

# get df_brain
df_brain = pd.DataFrame(X)
df_brain.index = subject_ids
df_brain.index.name = 'subject_id'
X_columns = pd.MultiIndex.from_arrays(
    [
        [f"area_{i+1}" for i in range(100)],  # area_label_idx
        ["network_1"] * 50 + ["network_2"] * 25 + ["network_3"] * 25  # brain_network_idx
    ],
    names=["brain_area","brain_network"]
)
df_brain.columns = X_columns

# get df_behavior
df_behavior = pd.DataFrame(Y)
df_behavior.index = subject_ids
df_behavior.index.name = 'subject_id'
df_behavior.columns = [f"behavioral_variable_{idx+1}" for idx in range(len(df_behavior.columns))]

###############################################################################
## Prepare Analysis: Somehow part of the question?
###############################################################################

# get feature groups: features in df_brain belong to 3 groups, features in df_behavior don't
# have any groups so we set the same number for all features (all features belong to one group)
groups_brain = df_brain.columns.get_level_values('brain_network').astype('category').codes.astype('int64')
groups_behavior = np.array([0 for f in range(len(df_behavior.columns))])
feature_groups = [groups_brain,groups_behavior]

# define latent dimensions
latent_dimensions = 1

# define folds
cv = KFold(5)

# just get numpy arrays
X1 = df_brain.values
X2 = df_behavior.values

###############################################################################
## Actual Question: Run GridSearch with Pipeline that includes Standardization 
## and GRCCA
###############################################################################

# define an estimator
estimator = Pipeline([
    ('preprocessing', MultiViewPreprocessing((StandardScaler(),StandardScaler()))),
    ('grcca',GRCCA(latent_dimensions=latent_dimensions,random_state=rng))
    ])

# define grid
param_grid = {'grcca__c':[[10**x for x in range(-1,1)],[10**x for x in range(-1,1)]],
              'grcca__mu':[[10**x for x in range(-1,1)],[0]]}

# run gridsearch
grid = GridSearchCV(estimator,param_grid,cv=cv)
grid.fit([X1,X2],grcca__feature_groups=feature_groups)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions