[Regression] TabPFN v2.6 vs. v2.5 has worse log score and other proper scoring rules

### Describe the bug

### Description
We observed poor performance of TabPFN v2.6 on the [ScoringBench](https://scoringbench.bolt.host/) in the log score and other proper scoring rules like CRPS, CRLS, CDE and beta energy scores which allows to judge the quality of distributional forecasts - for more discussions see https://arxiv.org/abs/2603.08206 

In our benchmarks, v2.6 ranked below v2.5 in log score. One example where the problem occurs is OpenML Dataset 44056 https://www.openml.org/search?type=data&status=active&id=44056

### MWE
We unittested our log score implementation against the TabPFN internal implementation and found numerical equivalence between our implementation and `criterion.compute_scaled_log_probs(logits)` ruling out a log score specific problem on our side.
Here I provide a minimal working example to demonstrate a difference in performance of tabpfnv2.5 to tabpfnv2.6:



### Steps/Code to Reproduce

```python
import numpy as np
import openml
import torch
from tabpfn import TabPFNRegressor

# --- SETUP ---
DATASET_ID = 44056
V25, V26 = "v2.5", "v2.6"
MODEL_PATHS = {
    V25: "tabpfn-v2.5-regressor-v2.5_real.ckpt",
    V26: "tabpfn-v2.6-regressor-v2.6_default.ckpt",
}

print(f"Loading dataset {DATASET_ID}...")
dataset = openml.datasets.get_dataset(DATASET_ID)
X, y, _, _ = dataset.get_data(dataset_format="dataframe", target=dataset.default_target_attribute)

# Shuffle before splitting
np.random.seed(42)
perm = np.random.permutation(len(X))
X, y = X.iloc[perm].reset_index(drop=True), np.asarray(y)[perm]

# Keep only numeric columns
numeric_cols = X.select_dtypes(include=[np.number]).columns
X = X[numeric_cols]

# --- DATA QUALITY CHECKS ---
print("\n=== DATA QUALITY CHECKS ===")
print(f"X shape: {X.shape}, y shape: {y.shape}")
print(f"X dtype: {X.dtypes.unique()}, y dtype: {y.dtype}")
print(f"y min: {np.nanmin(y):.6f}, max: {np.nanmax(y):.6f}, mean: {np.nanmean(y):.6f}, std: {np.nanstd(y):.6f}")
print(f"y NaN count: {np.isnan(y).sum()}, Inf count: {np.isinf(y).sum()}")
for col in X.columns:
    x_col = X[col]
    x_arr = x_col.to_numpy(dtype=float)
    nan_count = np.isnan(x_arr).sum()
    print(f"  {col}: min={np.nanmin(x_arr):.6f}, max={np.nanmax(x_arr):.6f}, NaN={nan_count}")


n_train = 2400
X_test_size = 600
X_train, X_test = X[:n_train], X[n_train:n_train + X_test_size]
y_train, y_test = np.asarray(y[:n_train]), np.asarray(y[n_train:n_train + X_test_size])
results_raw = {}

for name, path in MODEL_PATHS.items():
    print(f"\nEvaluating {name}...")
    model = TabPFNRegressor(device="cuda", model_path=path)
    model.fit(X_train, y_train)

    with torch.no_grad():
        out = model.predict(X_test, output_type="full")
        logits, criterion = out["logits"].cuda(), out["criterion"]
        y_t = torch.as_tensor(y_test, device="cuda", dtype=torch.float32).unsqueeze(-1)

        # Compute NLL loss using criterion.forward()
        nll_vals = criterion.forward(logits, y_t).cpu().numpy().flatten()
        
        inf_indices = np.where(np.isinf(nll_vals) | np.isnan(nll_vals))[0]
        
        y_pred = model.predict(X_test)
        residuals = y_test - y_pred

    results_raw[name] = {
        'nll_raw': nll_vals, 
        'residuals': residuals, 'inf_indices': inf_indices
    }

def print_table(title, data_dict, nll_key):
    print(f"\n{title}")
    print("-" * 80)
    print(f"{'Metric':<25} | {V25:<12} | {V26:<12} | Change")
    print("-" * 80)
    metrics = [
        ("Mean NLL",           lambda d: np.mean(d[nll_key])),
        ("Median NLL",         lambda d: np.median(d[nll_key])),
        ("Max NLL",            lambda d: np.max(d[nll_key])),
        ("MAE",                lambda d: np.mean(np.abs(d['residuals']))),
        ("Inf/NaN Count",      lambda d: len(d['inf_indices'])),
    ]
    for label, fn in metrics:
        a, b = fn(data_dict[V25]), fn(data_dict[V26])
        print(f"{label:<25} | {a:<12.4f} | {b:<12.4f} | {b - a:+.4f}")

# --- REPORT 1: ALL SAMPLES ---
print_table("REPORT: ALL SAMPLES", results_raw, 'nll_raw')

# --- REPORT 2: EXCLUDING INFS (UNION MASK) ---
common_mask = np.ones(X_test_size, dtype=bool)
for name in MODEL_PATHS.keys():
    common_mask[results_raw[name]['inf_indices']] = False

results_filtered = {}
for name in MODEL_PATHS.keys():
    results_filtered[name] = {
        k: (v[common_mask] if isinstance(v, np.ndarray) and len(v) == X_test_size else v)
        for k, v in results_raw[name].items()
    }
    results_filtered[name]['inf_indices'] = np.array([]) # Reset for table display

print_table(f"REPORT: EXCLUDING INFS ({np.sum(~common_mask)} samples removed)", results_filtered, 'nll_raw')
```


### Expected Results

Roughly similar NLL values (or better)

### Actual Results

```
python debug_tabpfn_versions.py
Loading dataset 44056...

=== DATA QUALITY CHECKS ===
X shape: (8641, 3), y shape: (8641,)
X dtype: [dtype('float64')], y dtype: uint8
y min: 1.000000, max: 40.000000, mean: 16.863326, std: 12.384738
y NaN count: 0, Inf count: 0
  northing: min=-0.010000, max=3.806000, NaN=0
  easting: min=-0.004000, max=1.560000, NaN=0
  resistivity: min=0.890000, max=166.010000, NaN=0

Evaluating v2.5...

Evaluating v2.6...

REPORT: ALL SAMPLES
--------------------------------------------------------------------------------
Metric                    | v2.5         | v2.6         | Change
--------------------------------------------------------------------------------
Mean NLL                  | -1.0728      | inf          | +inf
Median NLL                | -1.2498      | 12.6195      | +13.8693
Max NLL                   | 4.3556       | inf          | +inf
MAE                       | 1.2312       | 0.8073       | -0.4239
Inf/NaN Count             | 0.0000       | 217.0000     | +217.0000

REPORT: EXCLUDING INFS (217 samples removed)
--------------------------------------------------------------------------------
Metric                    | v2.5         | v2.6         | Change
--------------------------------------------------------------------------------
Mean NLL                  | -1.0275      | 6.6914       | +7.7189
Median NLL                | -1.2132      | 7.8900       | +9.1032
Max NLL                   | 3.5921       | 16.1006      | +12.5085
MAE                       | 1.6193       | 1.0045       | -0.6148
Inf/NaN Count             | 0.0000       | 0.0000       | +0.0000
```


### Versions

```shell
PyTorch version: 2.9.1+cu128
CUDA used to build PyTorch: 12.8

Dependency Versions:
--------------------
tabpfn: 7.0.1
torch: 2.9.1
numpy: 2.1.3
scipy: 1.15.3
pandas: 2.3.3
scikit-learn: 1.6.1
typing_extensions: 4.15.0
einops: 0.8.2
huggingface-hub: 0.36.2
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Regression] TabPFN v2.6 vs. v2.5 has worse log score and other proper scoring rules #873

Describe the bug

Description

MWE

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Regression] TabPFN v2.6 vs. v2.5 has worse log score and other proper scoring rules #873

Description

Describe the bug

Description

MWE

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions