-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Hello! This is Robin.
Thank you for the amazing work. I am a researcher working on linguistic features related to clinical symptoms and I had issues concerning the stability of the result produce by SCCA_IPLS. I was able to reproduce the issue even with random data. See below :
I have very strange behavior when using the SCCA_IPLS model with 2 or more latent dimensions. In many cases it failed to fit because NaN values appear in the variable "target = np.sum(scores[mask], axis=0) " line 151 in file "linear/_iterative/_elastic.py."
I think it is due to numerical instability that leads to fitting errors. This issue is particularly noticeable when a small constant is added to the data, but it resolves when the latent dimension is set to one.
Code to reproduce:
import numpy as np
import pandas as pd
from cca_zoo.linear import SCCA_IPLS
from sklearn.preprocessing import StandardScaler, RobustScaler
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from tqdm import tqdm
from sklearn.metrics import make_scorer
# Initialize random data with a small constant added to avoid numerical instability
np.random.seed(10)
small_constant = 0#0.000001 # 10
X = np.random.rand(100, 5) + small_constant
Y = np.random.rand(100, 5) + small_constant
assert not np.isnan(X).any(), "X contains NaN values"
assert not np.isnan(Y).any(), "Y contains NaN values"
# Dummy best parameters
best_overall_params = {
'alpha': 0.1,
'l1_ratio': 0.1,
'latent_dimensions': 2
}
# Initialize the SCCA_IPLS model
model = SCCA_IPLS(random_state=2,
alpha=best_overall_params['alpha'],
l1_ratio=best_overall_params['l1_ratio'],
latent_dimensions=best_overall_params['latent_dimensions'])
def cca_score(model, X_Y):
X, Y = X_Y
U, V = model.transform((X, Y))
canonical_correlations = [np.corrcoef(U[:, i], V[:, i])[0, 1] for i in range(U.shape[1])]
return np.max(canonical_correlations)
cca_scorer = make_scorer(cca_score, greater_is_better=True, needs_proba=False)
def fit_model(model, X, Y):
try:
model.fit((X, Y))
U, V = model.transform((X, Y))
return U, V
except Exception as e:
print(f"Error during model fitting: {e}")
return None, None
U, V = fit_model(model, X, Y)
# Check for NaNs in the original fit
if U is None or V is None:
raise ValueError("Model fitting failed for the original data.")
# Calculate the original canonical correlation for the first dimension
original_corr = np.corrcoef(U[:, 0], V[:, 0])[0, 1]
# Permutation testing
num_permutations = 100
null_distributions = []
for _ in tqdm(range(num_permutations), desc="Permutations"):
# Shuffle Y values
Y_permuted = shuffle(Y, random_state=np.random.randint(0, 10000))
# Ensure no NaN values after shuffling
assert not np.isnan(Y_permuted).any(), "Y_permuted contains NaN values before fitting"
U_perm, V_perm = fit_model(model, X, Y_permuted)
# Check for NaNs in the permuted fit
if U_perm is None or V_perm is None:
print("Skipping permutation due to fitting error")
continue # Skip this permutation if fitting failed
perm_corr = np.corrcoef(U_perm[:, 0], V_perm[:, 0])[0, 1]
null_distributions.append(perm_corr)
# Check if null_distributions is empty
if not null_distributions:
raise ValueError("All permutations failed.")
null_distributions = np.array(null_distributions)
p_value = (np.sum(null_distributions >= original_corr) + 1.0) / (num_permutations + 1)
# Create a DataFrame to display the significant results if any
significant_results = pd.DataFrame({
'Correlation': [original_corr],
'p-value': [p_value]
})
# Display the results
print(significant_results)
# Plot the null distribution and original correlation for visualization
plt.hist(null_distributions, bins=50, alpha=0.7, label='Null Distribution')
plt.axvline(original_corr, color='red', linestyle='dashed', linewidth=2, label='Original Correlation')
plt.legend()
plt.title('Null Distribution of Canonical Correlations')
plt.xlabel('Correlation')
plt.ylabel('Frequency')
plt.show()Observations:
- With latent_dimensions=2 and small_constant=0 or 1.0e10-5: The code fails with numerical instability errors during model fitting.
- With latent_dimensions=2 and small_constant=10: the code runs succefully
- With latent_dimensions=1: The code runs successfully regardless of the small constant added.
Expected Behaviour
The model should be able to handle multiple latent dimensions without numerical instability.
The addition of constants should not affect the result as it SCCA should be linear invariant.
Working environment
Ubuntu 22.04.4 LTS
Python 3.9.19
cca-zoo==2.6.0
scikit-learn==1.5.1
scipy==1.11.3
numpy==1.26.0
Thank you for your help,
Robin