RFX-Fuse API Reference

Complete API documentation for RFX-Fuse: Breiman and Cutler's Unified ML Engine.

RandomForestClassifier
RandomForestRegressor
RandomForestUnsupervised
Model Persistence
Imputation
Visualization
Utility Functions
Data Loading

RandomForestClassifier

rfx.RandomForestClassifier(
    ntree=100,
    mtry=0,
    maxcat=10,
    maxnode=0,
    minndsize=1,
    nodesize=5,
    iseed=12345,
    compute_proximity=False,
    compute_importance=True,
    compute_local_importance=False,
    use_gpu=False,
    use_qlora=False,
    quant_mode="nf4",
    rank=32,
    batch_size=0,
    use_casewise=False,
    use_rfgap=False,
    n_threads_cpu=0,
    show_progress=True
)

Parameters

Parameter	Type	Default	Description
`ntree`	int	100	Number of trees in the forest
`mtry`	int	0	Features to consider at each split. 0 = auto (sqrt(n_features))
`maxcat`	int	10	Maximum categories for categorical variables
`maxnode`	int	0	Maximum nodes per tree. 0 = unlimited
`minndsize`	int	1	Minimum node size for splitting
`nodesize`	int	5	Minimum terminal node size
`iseed`	int	12345	Random seed for reproducibility
`compute_proximity`	bool	False	Compute sample proximity matrix
`compute_importance`	bool	True	Compute overall feature importance
`compute_local_importance`	bool	False	Compute per-sample feature importance
`use_gpu`	bool	False	Enable CUDA GPU acceleration
`use_qlora`	bool	False	Enable QLoRA low-rank proximity compression
`quant_mode`	str	"nf4"	Quantization mode: "int8", "nf4", "fp16", "fp32"
`rank`	int	32	Low-rank dimension for QLoRA compression
`batch_size`	int	0	GPU batch size. 0 = auto
`use_casewise`	bool	False	Use case-wise (bootstrap frequency) weighting
`use_rfgap`	bool	False	Use RF-GAP proximity normalization
`n_threads_cpu`	int	0	CPU threads. 0 = auto
`show_progress`	bool	True	Show training progress bar

Core Methods

fit(X, y)

Train the random forest classifier.

model.fit(X, y)

Returns: self

predict(X)

Predict class labels for samples.

predictions = model.predict(X)

Returns: ndarray of shape (n_samples,) with predicted class labels

predict_proba(X)

Predict class probabilities for samples.

probabilities = model.predict_proba(X)

Returns: ndarray of shape (n_samples, n_classes) with class probabilities

get_oob_error()

Get out-of-bag error rate.

error = model.get_oob_error()

Returns: float, OOB error rate (0.0 to 1.0)

Variable Importance Methods

feature_importances_()

Get overall feature importance scores (permutation importance).

importance = model.feature_importances_()

Returns: ndarray of shape (n_features,)

get_local_importance()

Get per-sample feature importance matrix.

local_imp = model.get_local_importance()
# local_imp[i, j] = importance of feature j for sample i

Returns: ndarray of shape (n_samples, n_features)

Note: Requires compute_local_importance=True during training.

Proximity Importance Methods (Novel)

get_proximity_importance()

Get local proximity importance matrix.

prox_imp = model.get_proximity_importance()
# prox_imp[i, k] = feature k's contribution to sample i's similarity

Returns: ndarray of shape (n_samples, n_features)

Note: Requires compute_proximity=True during training.

get_local_proximity_importance()

Alias for get_proximity_importance().

get_overall_proximity_importance()

Get overall proximity importance vector (mean across all samples).

overall_prox_imp = model.get_overall_proximity_importance()

Returns: ndarray of shape (n_features,)

Similarity Search Methods

get_top_k_similar(query_idx, k=10, exclude_self=True)

Get top-K most similar training samples.

indices, scores = model.get_top_k_similar(query_idx, k=10)

Returns: Tuple of (indices, similarity_scores) arrays

get_top_k_similar_with_explanations(sample_idx, k=5, n_explanations=3)

Get top-K similar samples with feature explanations.

result = model.get_top_k_similar_with_explanations(sample_idx, k=5, n_explanations=3)
indices, scores, per_sample_scores, feat_idx, feat_imp = result

Returns: Tuple of:

indices: Top-K similar sample indices
scores: Similarity scores
per_sample_scores: Per-sample proximity importance for query
feat_idx: Top feature indices explaining similarity
feat_imp: Feature importance values for explanations

Outlier Detection Methods

compute_outlier_scores(mode="full", n_anchors=100)

Compute Breiman-Cutler outlier scores for all training samples.

scores = model.compute_outlier_scores()
# Scores > 10 typically indicate outliers

Parameter	Type	Default	Description
`mode`	str	"full"	"full" (exact) or "greedy" (approximate)
`n_anchors`	int	100	Anchor points for greedy mode

Returns: ndarray of normalized outlier scores

compute_outliers(k=10, mode="full", n_anchors=100)

Get top-K outliers.

top_indices, top_scores = model.compute_outliers(k=10)

Returns: Tuple of (indices, scores) for top-K outliers

Proximity Matrix Methods

get_proximity_matrix()

Get full proximity matrix (CPU only, not for QLoRA).

prox = model.get_proximity_matrix()

Returns: ndarray of shape (n_samples, n_samples)

get_lowrank_factors()

Get low-rank proximity factors A and B where P ≈ A @ B.T.

A, B, rank = model.get_lowrank_factors()

Returns: tuple (A, B, rank)

compute_mds_from_factors(k=3)

Compute MDS coordinates from low-rank factors.

mds = model.compute_mds_from_factors(k=3)

Returns: ndarray of shape (n_samples, k)

Other Methods

get_prototypes(n_prototypes=3)

Get most representative samples in proximity space.

prototypes = model.get_prototypes(n_prototypes=5)
# Returns: [(sample_idx, prototype_score), ...]

get_leaf_assignments()

Get leaf node assignments for all samples in all trees.

leaves = model.get_leaf_assignments()
# Shape: (ntree, nsample)

save(filepath)

Save trained model to file.

model.save("model.rfx")

RandomForestRegressor

rfx.RandomForestRegressor(
    ntree=100,
    mtry=0,
    maxnode=0,
    minndsize=5,
    nodesize=5,
    iseed=12345,
    compute_proximity=False,
    compute_importance=True,
    compute_local_importance=False,
    use_gpu=False,
    use_qlora=False,
    quant_mode="nf4",
    rank=32,
    batch_size=0,
    n_threads_cpu=0,
    show_progress=True
)

Parameters

Same as RandomForestClassifier, except:

No nclass parameter
minndsize default is 5 (regression typically needs larger nodes)

Methods

All methods from RandomForestClassifier are available, except:

predict_proba() - not applicable for regression

get_oob_error()

For regression, returns OOB Mean Squared Error.

mse = model.get_oob_error()

RandomForestUnsupervised

Breiman and Cutler's unsupervised learning: classify real vs. synthetic permuted data.

rfx.RandomForestUnsupervised(
    ntree=100,
    mtry=0,
    maxcat=10,
    maxnode=0,
    minndsize=1,
    nodesize=5,
    iseed=12345,
    compute_proximity=False,
    compute_importance=True,
    compute_local_importance=False,
    use_gpu=False,
    use_qlora=False,
    quant_mode="nf4",
    rank=32,
    batch_size=0,
    n_threads_cpu=0,
    show_progress=True
)

Core Methods

fit(X)

Train on unlabeled data. Internally creates synthetic class by permuting features.

model.fit(X)  # No labels needed

predict_proba(X)

Predict probability of being "real" vs "synthetic".

proba = model.predict_proba(X)
# proba[:, 0] = P(synthetic), proba[:, 1] = P(real)

get_oob_error()

OOB error indicates ability to distinguish real from synthetic data.

Low error (< 0.3): Strong feature dependencies detected
High error (~ 0.5): Features are nearly independent

All Other Methods

Same as RandomForestClassifier:

feature_importances_(), get_local_importance()
get_proximity_importance(), get_overall_proximity_importance()
get_top_k_similar(), get_top_k_similar_with_explanations()
compute_outlier_scores(), compute_outliers()
get_proximity_matrix(), get_lowrank_factors(), compute_mds_from_factors()
save()

Model Persistence

save(filepath)

Save a trained model.

model.save("model.rfx")

rfx.load(filepath)

Load a saved model.

model = rfx.load("model.rfx")
# Works for Classifier, Regressor, or Unsupervised

Imputation

rfx_impute_rough(X, n_trees=100, n_iterations=5, use_gpu=True, verbose=False, seed=42)

Young-Cutler (2017) RF-based imputation method.

from rfx_impute import rfx_impute_rough

X_imputed, n_iterations = rfx_impute_rough(
    X_missing,
    n_trees=100,
    n_iterations=5,
    use_gpu=True
)

Parameter	Type	Default	Description
`X`	ndarray	required	Data with NaN values
`n_trees`	int	100	Trees per iteration
`n_iterations`	int	5	Refinement iterations
`use_gpu`	bool	True	GPU acceleration
`verbose`	bool	False	Print progress
`seed`	int	42	Random seed

Returns: Tuple of (X_imputed, n_iterations)

Visualization

rfx.rfviz()

Generate interactive RFViz visualization with linked brushing.

rfx.rfviz(
    rf_model,
    X,
    y,
    feature_names=None,
    class_names=None,
    n_clusters=3,
    title="RFViz",
    output_file="rfviz.html",
    show_in_browser=True,
    save_html=True,
    mds_k=3
)

Features:

2x2 dashboard layout
Input features parallel coordinates
Local importance parallel coordinates
3D MDS proximity plot
Class votes heatmap
Linked brushing across all plots

Utility Functions

rfx.cuda_is_available()

Check if CUDA GPU is available.

if rfx.cuda_is_available():
    print("GPU acceleration available")

rfx.clear_gpu_cache()

Clear GPU memory cache.

rfx.clear_gpu_cache()

rfx.get_gpu_memory_info()

Get GPU memory information.

info = rfx.get_gpu_memory_info()
print(f"Free: {info['free'] / 1e9:.1f} GB")

Data Loading

rfx.load_wine()

Load the UCI Wine dataset.

X, y = rfx.load_wine()
# X: (178, 13), y: (178,) with labels 0, 1, 2

rfx.load_iris()

Load the UCI Iris dataset.

X, y = rfx.load_iris()
# X: (150, 4), y: (150,) with labels 0, 1, 2

Quantization Modes (QLoRA)

Mode	Bits	Memory	Use Case
`fp32`	32	1x	Debugging
`fp16`	16	2x reduction	Default
`int8`	8	4x reduction	Large datasets
`nf4`	4	8x reduction	Very large datasets

FilesExpand file tree

API.md

Latest commit

History

API.md

File metadata and controls

RFX-Fuse API Reference

Table of Contents

RandomForestClassifier

Parameters

Core Methods

fit(X, y)

predict(X)

predict_proba(X)

get_oob_error()

Variable Importance Methods

feature_importances_()

get_local_importance()

Proximity Importance Methods (Novel)

get_proximity_importance()

get_local_proximity_importance()

get_overall_proximity_importance()

Similarity Search Methods

get_top_k_similar(query_idx, k=10, exclude_self=True)

get_top_k_similar_with_explanations(sample_idx, k=5, n_explanations=3)

Outlier Detection Methods

compute_outlier_scores(mode="full", n_anchors=100)

compute_outliers(k=10, mode="full", n_anchors=100)

Proximity Matrix Methods

get_proximity_matrix()

get_lowrank_factors()

compute_mds_from_factors(k=3)

Other Methods

get_prototypes(n_prototypes=3)

get_leaf_assignments()

save(filepath)

RandomForestRegressor

Parameters

Methods

get_oob_error()

RandomForestUnsupervised

Core Methods

fit(X)

predict_proba(X)

get_oob_error()

All Other Methods

Model Persistence

save(filepath)

rfx.load(filepath)

Imputation

rfx_impute_rough(X, n_trees=100, n_iterations=5, use_gpu=True, verbose=False, seed=42)

Visualization

rfx.rfviz()

Utility Functions

rfx.cuda_is_available()

rfx.clear_gpu_cache()

rfx.get_gpu_memory_info()

Data Loading

rfx.load_wine()

rfx.load_iris()

Quantization Modes (QLoRA)