Describe the bug
Description
We observed poor performance of TabPFN v2.6 on the ScoringBench in the log score and other proper scoring rules like CRPS, CRLS, CDE and beta energy scores which allows to judge the quality of distributional forecasts - for more discussions see https://arxiv.org/abs/2603.08206
In our benchmarks, v2.6 ranked below v2.5 in log score. One example where the problem occurs is OpenML Dataset 44056 https://www.openml.org/search?type=data&status=active&id=44056
MWE
We unittested our log score implementation against the TabPFN internal implementation and found numerical equivalence between our implementation and criterion.compute_scaled_log_probs(logits) ruling out a log score specific problem on our side.
Here I provide a minimal working example to demonstrate a difference in performance of tabpfnv2.5 to tabpfnv2.6:
Steps/Code to Reproduce
import numpy as np
import openml
import torch
from tabpfn import TabPFNRegressor
# --- SETUP ---
DATASET_ID = 44056
V25, V26 = "v2.5", "v2.6"
MODEL_PATHS = {
V25: "tabpfn-v2.5-regressor-v2.5_real.ckpt",
V26: "tabpfn-v2.6-regressor-v2.6_default.ckpt",
}
print(f"Loading dataset {DATASET_ID}...")
dataset = openml.datasets.get_dataset(DATASET_ID)
X, y, _, _ = dataset.get_data(dataset_format="dataframe", target=dataset.default_target_attribute)
# Shuffle before splitting
np.random.seed(42)
perm = np.random.permutation(len(X))
X, y = X.iloc[perm].reset_index(drop=True), np.asarray(y)[perm]
# Keep only numeric columns
numeric_cols = X.select_dtypes(include=[np.number]).columns
X = X[numeric_cols]
# --- DATA QUALITY CHECKS ---
print("\n=== DATA QUALITY CHECKS ===")
print(f"X shape: {X.shape}, y shape: {y.shape}")
print(f"X dtype: {X.dtypes.unique()}, y dtype: {y.dtype}")
print(f"y min: {np.nanmin(y):.6f}, max: {np.nanmax(y):.6f}, mean: {np.nanmean(y):.6f}, std: {np.nanstd(y):.6f}")
print(f"y NaN count: {np.isnan(y).sum()}, Inf count: {np.isinf(y).sum()}")
for col in X.columns:
x_col = X[col]
x_arr = x_col.to_numpy(dtype=float)
nan_count = np.isnan(x_arr).sum()
print(f" {col}: min={np.nanmin(x_arr):.6f}, max={np.nanmax(x_arr):.6f}, NaN={nan_count}")
n_train = 2400
X_test_size = 600
X_train, X_test = X[:n_train], X[n_train:n_train + X_test_size]
y_train, y_test = np.asarray(y[:n_train]), np.asarray(y[n_train:n_train + X_test_size])
results_raw = {}
for name, path in MODEL_PATHS.items():
print(f"\nEvaluating {name}...")
model = TabPFNRegressor(device="cuda", model_path=path)
model.fit(X_train, y_train)
with torch.no_grad():
out = model.predict(X_test, output_type="full")
logits, criterion = out["logits"].cuda(), out["criterion"]
y_t = torch.as_tensor(y_test, device="cuda", dtype=torch.float32).unsqueeze(-1)
# Compute NLL loss using criterion.forward()
nll_vals = criterion.forward(logits, y_t).cpu().numpy().flatten()
inf_indices = np.where(np.isinf(nll_vals) | np.isnan(nll_vals))[0]
y_pred = model.predict(X_test)
residuals = y_test - y_pred
results_raw[name] = {
'nll_raw': nll_vals,
'residuals': residuals, 'inf_indices': inf_indices
}
def print_table(title, data_dict, nll_key):
print(f"\n{title}")
print("-" * 80)
print(f"{'Metric':<25} | {V25:<12} | {V26:<12} | Change")
print("-" * 80)
metrics = [
("Mean NLL", lambda d: np.mean(d[nll_key])),
("Median NLL", lambda d: np.median(d[nll_key])),
("Max NLL", lambda d: np.max(d[nll_key])),
("MAE", lambda d: np.mean(np.abs(d['residuals']))),
("Inf/NaN Count", lambda d: len(d['inf_indices'])),
]
for label, fn in metrics:
a, b = fn(data_dict[V25]), fn(data_dict[V26])
print(f"{label:<25} | {a:<12.4f} | {b:<12.4f} | {b - a:+.4f}")
# --- REPORT 1: ALL SAMPLES ---
print_table("REPORT: ALL SAMPLES", results_raw, 'nll_raw')
# --- REPORT 2: EXCLUDING INFS (UNION MASK) ---
common_mask = np.ones(X_test_size, dtype=bool)
for name in MODEL_PATHS.keys():
common_mask[results_raw[name]['inf_indices']] = False
results_filtered = {}
for name in MODEL_PATHS.keys():
results_filtered[name] = {
k: (v[common_mask] if isinstance(v, np.ndarray) and len(v) == X_test_size else v)
for k, v in results_raw[name].items()
}
results_filtered[name]['inf_indices'] = np.array([]) # Reset for table display
print_table(f"REPORT: EXCLUDING INFS ({np.sum(~common_mask)} samples removed)", results_filtered, 'nll_raw')
Expected Results
Roughly similar NLL values (or better)
Actual Results
python debug_tabpfn_versions.py
Loading dataset 44056...
=== DATA QUALITY CHECKS ===
X shape: (8641, 3), y shape: (8641,)
X dtype: [dtype('float64')], y dtype: uint8
y min: 1.000000, max: 40.000000, mean: 16.863326, std: 12.384738
y NaN count: 0, Inf count: 0
northing: min=-0.010000, max=3.806000, NaN=0
easting: min=-0.004000, max=1.560000, NaN=0
resistivity: min=0.890000, max=166.010000, NaN=0
Evaluating v2.5...
Evaluating v2.6...
REPORT: ALL SAMPLES
--------------------------------------------------------------------------------
Metric | v2.5 | v2.6 | Change
--------------------------------------------------------------------------------
Mean NLL | -1.0728 | inf | +inf
Median NLL | -1.2498 | 12.6195 | +13.8693
Max NLL | 4.3556 | inf | +inf
MAE | 1.2312 | 0.8073 | -0.4239
Inf/NaN Count | 0.0000 | 217.0000 | +217.0000
REPORT: EXCLUDING INFS (217 samples removed)
--------------------------------------------------------------------------------
Metric | v2.5 | v2.6 | Change
--------------------------------------------------------------------------------
Mean NLL | -1.0275 | 6.6914 | +7.7189
Median NLL | -1.2132 | 7.8900 | +9.1032
Max NLL | 3.5921 | 16.1006 | +12.5085
MAE | 1.6193 | 1.0045 | -0.6148
Inf/NaN Count | 0.0000 | 0.0000 | +0.0000
Versions
PyTorch version: 2.9.1+cu128
CUDA used to build PyTorch: 12.8
Dependency Versions:
--------------------
tabpfn: 7.0.1
torch: 2.9.1
numpy: 2.1.3
scipy: 1.15.3
pandas: 2.3.3
scikit-learn: 1.6.1
typing_extensions: 4.15.0
einops: 0.8.2
huggingface-hub: 0.36.2
Describe the bug
Description
We observed poor performance of TabPFN v2.6 on the ScoringBench in the log score and other proper scoring rules like CRPS, CRLS, CDE and beta energy scores which allows to judge the quality of distributional forecasts - for more discussions see https://arxiv.org/abs/2603.08206
In our benchmarks, v2.6 ranked below v2.5 in log score. One example where the problem occurs is OpenML Dataset 44056 https://www.openml.org/search?type=data&status=active&id=44056
MWE
We unittested our log score implementation against the TabPFN internal implementation and found numerical equivalence between our implementation and
criterion.compute_scaled_log_probs(logits)ruling out a log score specific problem on our side.Here I provide a minimal working example to demonstrate a difference in performance of tabpfnv2.5 to tabpfnv2.6:
Steps/Code to Reproduce
Expected Results
Roughly similar NLL values (or better)
Actual Results
Versions