Skip to content

SCCA_IPLS not converging when latent dimension >1 ("Error during model fitting: Input y contains NaN") #196

@psycholinguistics2125

Description

@psycholinguistics2125

Hello! This is Robin.

Thank you for the amazing work. I am a researcher working on linguistic features related to clinical symptoms and I had issues concerning the stability of the result produce by SCCA_IPLS. I was able to reproduce the issue even with random data. See below :

I have very strange behavior when using the SCCA_IPLS model with 2 or more latent dimensions. In many cases it failed to fit because NaN values appear in the variable "target = np.sum(scores[mask], axis=0) " line 151 in file "linear/_iterative/_elastic.py."

I think it is due to numerical instability that leads to fitting errors. This issue is particularly noticeable when a small constant is added to the data, but it resolves when the latent dimension is set to one.

Code to reproduce:

import numpy as np
import pandas as pd
from cca_zoo.linear import SCCA_IPLS
from sklearn.preprocessing import StandardScaler, RobustScaler
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from tqdm import tqdm
from sklearn.metrics import make_scorer

# Initialize random data with a small constant added to avoid numerical instability
np.random.seed(10)
small_constant =  0#0.000001 # 10
X = np.random.rand(100, 5) + small_constant
Y = np.random.rand(100, 5) + small_constant

assert not np.isnan(X).any(), "X contains NaN values"
assert not np.isnan(Y).any(), "Y contains NaN values"

# Dummy best parameters
best_overall_params = {
    'alpha': 0.1,
    'l1_ratio': 0.1,
    'latent_dimensions': 2
}

# Initialize the SCCA_IPLS model
model = SCCA_IPLS(random_state=2, 
                  alpha=best_overall_params['alpha'],
                  l1_ratio=best_overall_params['l1_ratio'], 
                  latent_dimensions=best_overall_params['latent_dimensions'])

def cca_score(model, X_Y):
    X, Y = X_Y
    U, V = model.transform((X, Y))
    canonical_correlations = [np.corrcoef(U[:, i], V[:, i])[0, 1] for i in range(U.shape[1])]
    return np.max(canonical_correlations)

cca_scorer = make_scorer(cca_score, greater_is_better=True, needs_proba=False)

def fit_model(model, X, Y):
    try:
        model.fit((X, Y))
        U, V = model.transform((X, Y))
        return U, V
    except Exception as e:
        print(f"Error during model fitting: {e}")
        return None, None

U, V = fit_model(model, X, Y)

# Check for NaNs in the original fit
if U is None or V is None:
    raise ValueError("Model fitting failed for the original data.")

# Calculate the original canonical correlation for the first dimension
original_corr = np.corrcoef(U[:, 0], V[:, 0])[0, 1]

# Permutation testing
num_permutations = 100
null_distributions = []

for _ in tqdm(range(num_permutations), desc="Permutations"):
    # Shuffle Y values
    Y_permuted = shuffle(Y, random_state=np.random.randint(0, 10000))
    
    # Ensure no NaN values after shuffling
    assert not np.isnan(Y_permuted).any(), "Y_permuted contains NaN values before fitting"

    U_perm, V_perm = fit_model(model, X, Y_permuted)

    # Check for NaNs in the permuted fit
    if U_perm is None or V_perm is None:
        print("Skipping permutation due to fitting error")
        continue  # Skip this permutation if fitting failed

    perm_corr = np.corrcoef(U_perm[:, 0], V_perm[:, 0])[0, 1]
    null_distributions.append(perm_corr)

# Check if null_distributions is empty
if not null_distributions:
    raise ValueError("All permutations failed.")

null_distributions = np.array(null_distributions)
p_value = (np.sum(null_distributions >= original_corr) + 1.0) / (num_permutations + 1)

# Create a DataFrame to display the significant results if any
significant_results = pd.DataFrame({
    'Correlation': [original_corr],
    'p-value': [p_value]
})

# Display the results
print(significant_results)

# Plot the null distribution and original correlation for visualization
plt.hist(null_distributions, bins=50, alpha=0.7, label='Null Distribution')
plt.axvline(original_corr, color='red', linestyle='dashed', linewidth=2, label='Original Correlation')
plt.legend()
plt.title('Null Distribution of Canonical Correlations')
plt.xlabel('Correlation')
plt.ylabel('Frequency')
plt.show()

Observations:

  • With latent_dimensions=2 and small_constant=0 or 1.0e10-5: The code fails with numerical instability errors during model fitting.
  • With latent_dimensions=2 and small_constant=10: the code runs succefully
  • With latent_dimensions=1: The code runs successfully regardless of the small constant added.

Expected Behaviour

The model should be able to handle multiple latent dimensions without numerical instability.
The addition of constants should not affect the result as it SCCA should be linear invariant.

Working environment

Ubuntu 22.04.4 LTS
Python 3.9.19
cca-zoo==2.6.0
scikit-learn==1.5.1
scipy==1.11.3
numpy==1.26.0

Thank you for your help,

Robin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions