Complete API documentation for RFX-Fuse: Breiman and Cutler's Unified ML Engine.
- RandomForestClassifier
- RandomForestRegressor
- RandomForestUnsupervised
- Model Persistence
- Imputation
- Visualization
- Utility Functions
- Data Loading
rfx.RandomForestClassifier(
ntree=100,
mtry=0,
maxcat=10,
maxnode=0,
minndsize=1,
nodesize=5,
iseed=12345,
compute_proximity=False,
compute_importance=True,
compute_local_importance=False,
use_gpu=False,
use_qlora=False,
quant_mode="nf4",
rank=32,
batch_size=0,
use_casewise=False,
use_rfgap=False,
n_threads_cpu=0,
show_progress=True
)| Parameter | Type | Default | Description |
|---|---|---|---|
ntree |
int | 100 | Number of trees in the forest |
mtry |
int | 0 | Features to consider at each split. 0 = auto (sqrt(n_features)) |
maxcat |
int | 10 | Maximum categories for categorical variables |
maxnode |
int | 0 | Maximum nodes per tree. 0 = unlimited |
minndsize |
int | 1 | Minimum node size for splitting |
nodesize |
int | 5 | Minimum terminal node size |
iseed |
int | 12345 | Random seed for reproducibility |
compute_proximity |
bool | False | Compute sample proximity matrix |
compute_importance |
bool | True | Compute overall feature importance |
compute_local_importance |
bool | False | Compute per-sample feature importance |
use_gpu |
bool | False | Enable CUDA GPU acceleration |
use_qlora |
bool | False | Enable QLoRA low-rank proximity compression |
quant_mode |
str | "nf4" | Quantization mode: "int8", "nf4", "fp16", "fp32" |
rank |
int | 32 | Low-rank dimension for QLoRA compression |
batch_size |
int | 0 | GPU batch size. 0 = auto |
use_casewise |
bool | False | Use case-wise (bootstrap frequency) weighting |
use_rfgap |
bool | False | Use RF-GAP proximity normalization |
n_threads_cpu |
int | 0 | CPU threads. 0 = auto |
show_progress |
bool | True | Show training progress bar |
Train the random forest classifier.
model.fit(X, y)Returns: self
Predict class labels for samples.
predictions = model.predict(X)Returns: ndarray of shape (n_samples,) with predicted class labels
Predict class probabilities for samples.
probabilities = model.predict_proba(X)Returns: ndarray of shape (n_samples, n_classes) with class probabilities
Get out-of-bag error rate.
error = model.get_oob_error()Returns: float, OOB error rate (0.0 to 1.0)
Get overall feature importance scores (permutation importance).
importance = model.feature_importances_()Returns: ndarray of shape (n_features,)
Get per-sample feature importance matrix.
local_imp = model.get_local_importance()
# local_imp[i, j] = importance of feature j for sample iReturns: ndarray of shape (n_samples, n_features)
Note: Requires compute_local_importance=True during training.
Get local proximity importance matrix.
prox_imp = model.get_proximity_importance()
# prox_imp[i, k] = feature k's contribution to sample i's similarityReturns: ndarray of shape (n_samples, n_features)
Note: Requires compute_proximity=True during training.
Alias for get_proximity_importance().
Get overall proximity importance vector (mean across all samples).
overall_prox_imp = model.get_overall_proximity_importance()Returns: ndarray of shape (n_features,)
Get top-K most similar training samples.
indices, scores = model.get_top_k_similar(query_idx, k=10)Returns: Tuple of (indices, similarity_scores) arrays
Get top-K similar samples with feature explanations.
result = model.get_top_k_similar_with_explanations(sample_idx, k=5, n_explanations=3)
indices, scores, per_sample_scores, feat_idx, feat_imp = resultReturns: Tuple of:
indices: Top-K similar sample indicesscores: Similarity scoresper_sample_scores: Per-sample proximity importance for queryfeat_idx: Top feature indices explaining similarityfeat_imp: Feature importance values for explanations
Compute Breiman-Cutler outlier scores for all training samples.
scores = model.compute_outlier_scores()
# Scores > 10 typically indicate outliers| Parameter | Type | Default | Description |
|---|---|---|---|
mode |
str | "full" | "full" (exact) or "greedy" (approximate) |
n_anchors |
int | 100 | Anchor points for greedy mode |
Returns: ndarray of normalized outlier scores
Get top-K outliers.
top_indices, top_scores = model.compute_outliers(k=10)Returns: Tuple of (indices, scores) for top-K outliers
Get full proximity matrix (CPU only, not for QLoRA).
prox = model.get_proximity_matrix()Returns: ndarray of shape (n_samples, n_samples)
Get low-rank proximity factors A and B where P ≈ A @ B.T.
A, B, rank = model.get_lowrank_factors()Returns: tuple (A, B, rank)
Compute MDS coordinates from low-rank factors.
mds = model.compute_mds_from_factors(k=3)Returns: ndarray of shape (n_samples, k)
Get most representative samples in proximity space.
prototypes = model.get_prototypes(n_prototypes=5)
# Returns: [(sample_idx, prototype_score), ...]Get leaf node assignments for all samples in all trees.
leaves = model.get_leaf_assignments()
# Shape: (ntree, nsample)Save trained model to file.
model.save("model.rfx")rfx.RandomForestRegressor(
ntree=100,
mtry=0,
maxnode=0,
minndsize=5,
nodesize=5,
iseed=12345,
compute_proximity=False,
compute_importance=True,
compute_local_importance=False,
use_gpu=False,
use_qlora=False,
quant_mode="nf4",
rank=32,
batch_size=0,
n_threads_cpu=0,
show_progress=True
)Same as RandomForestClassifier, except:
- No
nclassparameter minndsizedefault is 5 (regression typically needs larger nodes)
All methods from RandomForestClassifier are available, except:
predict_proba()- not applicable for regression
For regression, returns OOB Mean Squared Error.
mse = model.get_oob_error()Breiman and Cutler's unsupervised learning: classify real vs. synthetic permuted data.
rfx.RandomForestUnsupervised(
ntree=100,
mtry=0,
maxcat=10,
maxnode=0,
minndsize=1,
nodesize=5,
iseed=12345,
compute_proximity=False,
compute_importance=True,
compute_local_importance=False,
use_gpu=False,
use_qlora=False,
quant_mode="nf4",
rank=32,
batch_size=0,
n_threads_cpu=0,
show_progress=True
)Train on unlabeled data. Internally creates synthetic class by permuting features.
model.fit(X) # No labels neededPredict probability of being "real" vs "synthetic".
proba = model.predict_proba(X)
# proba[:, 0] = P(synthetic), proba[:, 1] = P(real)OOB error indicates ability to distinguish real from synthetic data.
- Low error (< 0.3): Strong feature dependencies detected
- High error (~ 0.5): Features are nearly independent
Same as RandomForestClassifier:
feature_importances_(),get_local_importance()get_proximity_importance(),get_overall_proximity_importance()get_top_k_similar(),get_top_k_similar_with_explanations()compute_outlier_scores(),compute_outliers()get_proximity_matrix(),get_lowrank_factors(),compute_mds_from_factors()save()
Save a trained model.
model.save("model.rfx")Load a saved model.
model = rfx.load("model.rfx")
# Works for Classifier, Regressor, or UnsupervisedYoung-Cutler (2017) RF-based imputation method.
from rfx_impute import rfx_impute_rough
X_imputed, n_iterations = rfx_impute_rough(
X_missing,
n_trees=100,
n_iterations=5,
use_gpu=True
)| Parameter | Type | Default | Description |
|---|---|---|---|
X |
ndarray | required | Data with NaN values |
n_trees |
int | 100 | Trees per iteration |
n_iterations |
int | 5 | Refinement iterations |
use_gpu |
bool | True | GPU acceleration |
verbose |
bool | False | Print progress |
seed |
int | 42 | Random seed |
Returns: Tuple of (X_imputed, n_iterations)
Generate interactive RFViz visualization with linked brushing.
rfx.rfviz(
rf_model,
X,
y,
feature_names=None,
class_names=None,
n_clusters=3,
title="RFViz",
output_file="rfviz.html",
show_in_browser=True,
save_html=True,
mds_k=3
)Features:
- 2x2 dashboard layout
- Input features parallel coordinates
- Local importance parallel coordinates
- 3D MDS proximity plot
- Class votes heatmap
- Linked brushing across all plots
Check if CUDA GPU is available.
if rfx.cuda_is_available():
print("GPU acceleration available")Clear GPU memory cache.
rfx.clear_gpu_cache()Get GPU memory information.
info = rfx.get_gpu_memory_info()
print(f"Free: {info['free'] / 1e9:.1f} GB")Load the UCI Wine dataset.
X, y = rfx.load_wine()
# X: (178, 13), y: (178,) with labels 0, 1, 2Load the UCI Iris dataset.
X, y = rfx.load_iris()
# X: (150, 4), y: (150,) with labels 0, 1, 2| Mode | Bits | Memory | Use Case |
|---|---|---|---|
fp32 |
32 | 1x | Debugging |
fp16 |
16 | 2x reduction | Default |
int8 |
8 | 4x reduction | Large datasets |
nf4 |
4 | 8x reduction | Very large datasets |