Skip to content

[Regression] TabPFN v2.6 vs. v2.5 has worse log score and other proper scoring rules #873

@jonaslandsgesell

Description

@jonaslandsgesell

Describe the bug

Description

We observed poor performance of TabPFN v2.6 on the ScoringBench in the log score and other proper scoring rules like CRPS, CRLS, CDE and beta energy scores which allows to judge the quality of distributional forecasts - for more discussions see https://arxiv.org/abs/2603.08206

In our benchmarks, v2.6 ranked below v2.5 in log score. One example where the problem occurs is OpenML Dataset 44056 https://www.openml.org/search?type=data&status=active&id=44056

MWE

We unittested our log score implementation against the TabPFN internal implementation and found numerical equivalence between our implementation and criterion.compute_scaled_log_probs(logits) ruling out a log score specific problem on our side.
Here I provide a minimal working example to demonstrate a difference in performance of tabpfnv2.5 to tabpfnv2.6:

Steps/Code to Reproduce

import numpy as np
import openml
import torch
from tabpfn import TabPFNRegressor

# --- SETUP ---
DATASET_ID = 44056
V25, V26 = "v2.5", "v2.6"
MODEL_PATHS = {
    V25: "tabpfn-v2.5-regressor-v2.5_real.ckpt",
    V26: "tabpfn-v2.6-regressor-v2.6_default.ckpt",
}

print(f"Loading dataset {DATASET_ID}...")
dataset = openml.datasets.get_dataset(DATASET_ID)
X, y, _, _ = dataset.get_data(dataset_format="dataframe", target=dataset.default_target_attribute)

# Shuffle before splitting
np.random.seed(42)
perm = np.random.permutation(len(X))
X, y = X.iloc[perm].reset_index(drop=True), np.asarray(y)[perm]

# Keep only numeric columns
numeric_cols = X.select_dtypes(include=[np.number]).columns
X = X[numeric_cols]

# --- DATA QUALITY CHECKS ---
print("\n=== DATA QUALITY CHECKS ===")
print(f"X shape: {X.shape}, y shape: {y.shape}")
print(f"X dtype: {X.dtypes.unique()}, y dtype: {y.dtype}")
print(f"y min: {np.nanmin(y):.6f}, max: {np.nanmax(y):.6f}, mean: {np.nanmean(y):.6f}, std: {np.nanstd(y):.6f}")
print(f"y NaN count: {np.isnan(y).sum()}, Inf count: {np.isinf(y).sum()}")
for col in X.columns:
    x_col = X[col]
    x_arr = x_col.to_numpy(dtype=float)
    nan_count = np.isnan(x_arr).sum()
    print(f"  {col}: min={np.nanmin(x_arr):.6f}, max={np.nanmax(x_arr):.6f}, NaN={nan_count}")


n_train = 2400
X_test_size = 600
X_train, X_test = X[:n_train], X[n_train:n_train + X_test_size]
y_train, y_test = np.asarray(y[:n_train]), np.asarray(y[n_train:n_train + X_test_size])
results_raw = {}

for name, path in MODEL_PATHS.items():
    print(f"\nEvaluating {name}...")
    model = TabPFNRegressor(device="cuda", model_path=path)
    model.fit(X_train, y_train)

    with torch.no_grad():
        out = model.predict(X_test, output_type="full")
        logits, criterion = out["logits"].cuda(), out["criterion"]
        y_t = torch.as_tensor(y_test, device="cuda", dtype=torch.float32).unsqueeze(-1)

        # Compute NLL loss using criterion.forward()
        nll_vals = criterion.forward(logits, y_t).cpu().numpy().flatten()
        
        inf_indices = np.where(np.isinf(nll_vals) | np.isnan(nll_vals))[0]
        
        y_pred = model.predict(X_test)
        residuals = y_test - y_pred

    results_raw[name] = {
        'nll_raw': nll_vals, 
        'residuals': residuals, 'inf_indices': inf_indices
    }

def print_table(title, data_dict, nll_key):
    print(f"\n{title}")
    print("-" * 80)
    print(f"{'Metric':<25} | {V25:<12} | {V26:<12} | Change")
    print("-" * 80)
    metrics = [
        ("Mean NLL",           lambda d: np.mean(d[nll_key])),
        ("Median NLL",         lambda d: np.median(d[nll_key])),
        ("Max NLL",            lambda d: np.max(d[nll_key])),
        ("MAE",                lambda d: np.mean(np.abs(d['residuals']))),
        ("Inf/NaN Count",      lambda d: len(d['inf_indices'])),
    ]
    for label, fn in metrics:
        a, b = fn(data_dict[V25]), fn(data_dict[V26])
        print(f"{label:<25} | {a:<12.4f} | {b:<12.4f} | {b - a:+.4f}")

# --- REPORT 1: ALL SAMPLES ---
print_table("REPORT: ALL SAMPLES", results_raw, 'nll_raw')

# --- REPORT 2: EXCLUDING INFS (UNION MASK) ---
common_mask = np.ones(X_test_size, dtype=bool)
for name in MODEL_PATHS.keys():
    common_mask[results_raw[name]['inf_indices']] = False

results_filtered = {}
for name in MODEL_PATHS.keys():
    results_filtered[name] = {
        k: (v[common_mask] if isinstance(v, np.ndarray) and len(v) == X_test_size else v)
        for k, v in results_raw[name].items()
    }
    results_filtered[name]['inf_indices'] = np.array([]) # Reset for table display

print_table(f"REPORT: EXCLUDING INFS ({np.sum(~common_mask)} samples removed)", results_filtered, 'nll_raw')

Expected Results

Roughly similar NLL values (or better)

Actual Results

python debug_tabpfn_versions.py
Loading dataset 44056...

=== DATA QUALITY CHECKS ===
X shape: (8641, 3), y shape: (8641,)
X dtype: [dtype('float64')], y dtype: uint8
y min: 1.000000, max: 40.000000, mean: 16.863326, std: 12.384738
y NaN count: 0, Inf count: 0
  northing: min=-0.010000, max=3.806000, NaN=0
  easting: min=-0.004000, max=1.560000, NaN=0
  resistivity: min=0.890000, max=166.010000, NaN=0

Evaluating v2.5...

Evaluating v2.6...

REPORT: ALL SAMPLES
--------------------------------------------------------------------------------
Metric                    | v2.5         | v2.6         | Change
--------------------------------------------------------------------------------
Mean NLL                  | -1.0728      | inf          | +inf
Median NLL                | -1.2498      | 12.6195      | +13.8693
Max NLL                   | 4.3556       | inf          | +inf
MAE                       | 1.2312       | 0.8073       | -0.4239
Inf/NaN Count             | 0.0000       | 217.0000     | +217.0000

REPORT: EXCLUDING INFS (217 samples removed)
--------------------------------------------------------------------------------
Metric                    | v2.5         | v2.6         | Change
--------------------------------------------------------------------------------
Mean NLL                  | -1.0275      | 6.6914       | +7.7189
Median NLL                | -1.2132      | 7.8900       | +9.1032
Max NLL                   | 3.5921       | 16.1006      | +12.5085
MAE                       | 1.6193       | 1.0045       | -0.6148
Inf/NaN Count             | 0.0000       | 0.0000       | +0.0000

Versions

PyTorch version: 2.9.1+cu128
CUDA used to build PyTorch: 12.8

Dependency Versions:
--------------------
tabpfn: 7.0.1
torch: 2.9.1
numpy: 2.1.3
scipy: 1.15.3
pandas: 2.3.3
scikit-learn: 1.6.1
typing_extensions: 4.15.0
einops: 0.8.2
huggingface-hub: 0.36.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions