GitHub - EttoreRocchi/phenocluster: PhenoCluster: a flexible data-driven framework for identifying clinical phenotypes using latent class and profile analysis

A flexible data-driven framework for identifying clinical phenotypes using latent class and profile analysis

Overview

PhenoCluster is a Python framework for unsupervised discovery of clinical phenotypes from heterogeneous patient data. It implements an end-to-end pipeline: from data preprocessing and latent class identification to outcome association analysis, survival modelling, and multistate transition modelling.

The framework is domain-agnostic and can be applied to any clinical cohort study where the goal is to identify latent patient subgroups and characterise their relationship with clinical outcomes. Users supply a dataset and a YAML configuration file; PhenoCluster handles model selection, phenotype assignment, and downstream inference automatically.

Key capabilities

Latent Class / Profile Analysis via the StepMix framework with native support for mixed continuous/categorical data and missing values
Automatic model selection using information criteria (BIC, AIC, ICL, CAIC, SABIC) with configurable cluster-size constraints
Classification quality assessment with per-phenotype Average Posterior Probability (AvePP) and assignment confidence metrics
Outcome association analysis with logistic regression yielding odds ratios, confidence intervals, and FDR-corrected p-values
Survival analysis with Cox proportional hazards models producing hazard ratios and log-rank tests
Multistate modelling with transition-specific Cox PH analysis, Monte Carlo simulation for state occupation probabilities with confidence interval bands, and clinical pathway enumeration
Comprehensive output including an interactive HTML report, forest plots with confidence intervals, Kaplan-Meier and Nelson-Aalen curves, heatmaps, and JSON/CSV data exports

Installation

Requires Python ≥ 3.11

From PyPI

pip install phenocluster

From source

git clone https://github.com/EttoreRocchi/phenocluster.git
cd phenocluster
pip install -e ".[dev]"

Quick start

1. Generate a configuration file

phenocluster create-config -p complete -o config.yaml

2. Edit the configuration

Open config.yaml and fill in your dataset-specific parameters:

global:
  project_name: "My Study"
  output_dir: "results"
  random_state: 42

data:
  continuous_columns:
    - age
    - bmi
    - lab_value_1
  categorical_columns:
    - sex
    - smoking_status
    - disease_stage
  split:
    test_size: 0.2

outcome:
  enabled: true
  outcome_columns:
    - mortality_30d
    - readmission_30d

survival:
  enabled: true
  targets:
    - name: "overall_survival"
      time_column: "time_to_death"
      event_column: "death_indicator"

3. Run the pipeline

phenocluster run -d data.csv -c config.yaml

4. Inspect results

Results are written to the output directory (default: results/):

File	Description
`analysis_report.html`	Comprehensive HTML report with all results and visualisations
`cluster_statistics.json`	Phenotype sizes, feature distributions, and classification quality
`outcome_results.json`	Odds ratios with confidence intervals and p-values
`survival_results.json`	Kaplan-Meier estimates and Cox PH hazard ratios
`multistate_results.json`	Transition-specific hazard ratios, pathways, and state occupation
`data/model_fit_metrics.csv`	Information criteria, entropy, and average posterior probabilities
`data/phenotypes_data.csv`	Original data augmented with phenotype assignments
`data/posterior_probabilities.csv`	Posterior class membership probabilities
`results/model_selection_summary.json`	Model selection comparison table and best model info
`results/feature_importance.json`	Feature characterisation per phenotype
`results/validation_report.json`	Internal validation metrics (train/test comparison)
`results/stability_results.json`	Consensus clustering stability metrics
`results/split_info.json`	Train/test split details
`results/external_validation_results.json`	External validation results (when enabled)
`phenocluster.log`	Pipeline execution log
`artifacts/`	Cached intermediate results for incremental re-runs

Pipeline overview

PhenoCluster executes the following stages in order:

Data quality assessment. Missingness patterns, correlations, variance, and MCAR testing.
Train/test split. Stratified splitting with configurable test size, performed before preprocessing to prevent data leakage.
Preprocessing. Imputation, outlier handling, categorical encoding, standardization, and feature selection -- fit on training data only, then applied to the test set.
Model selection. Cross-validated information criterion search over cluster counts (training set only).
Full-cohort refit. Once K is selected, preprocessing and LCA/LPA model are refitted on the entire cohort; phenotypes reordered by size (largest = Phenotype 0).
Stability analysis. Consensus clustering over subsampled runs.
Internal validation. Train/test log-likelihood comparison, cluster proportion stability, and outcome OR consistency.
Outcome association. Logistic regression for binary outcomes with FDR-corrected p-values (optional).
Survival analysis. Kaplan-Meier curves, Nelson-Aalen estimators, log-rank tests, and Cox PH hazard ratios (optional).
Multistate modelling. Transition-specific Cox PH models, transition hazard ratios, and Monte Carlo simulation (optional).
Report generation. Interactive HTML report with all figures and tables.

CLI reference

Command	Description
`phenocluster run -d DATA -c CONFIG [--force-rerun]`	Run the full pipeline
`phenocluster create-config [-p PROFILE] [-o OUTPUT]`	Generate a config YAML from a profile template
`phenocluster validate-config -c CONFIG [-d DATA]`	Validate config structure; cross-check columns against data
`phenocluster version`	Show version, repository link, and documentation link

Configuration profiles

Profiles set sensible defaults for common use-cases. Generate one with phenocluster create-config -p <profile>:

Profile	Description	Inference	Stability	Multistate
`descriptive`	Phenotype discovery only, no statistical inference	off	on	off
`complete`	All analyses enabled (outcomes, survival, multistate)	on	on	on
`quick`	Fast iteration for development	on	off	off

Configuration reference

See the full Configuration Reference in the documentation.

Documentation

Full documentation (statistical methods, configuration reference, output descriptions) is available at ettorerocchi.github.io/phenocluster.

Testing

pip install -e ".[dev]"
pytest tests/ -v

License

This project is licensed under the MIT License.

Citation

If you use PhenoCluster in your research, please cite:

Acknowledgment

This project relies on StepMix, a Python package for pseudo-likelihood estimation of generalized mixture models with external variables. We thank the authors for making their work openly available.

If you use this framework, please cite also:

Morin, S., Legault, R., Laliberté, F., Bakk, Z., Giguère, C.-É., de la Sablonnière, R., & Lacourse, É. (2025). StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables. Journal of Statistical Software, 113(8), 1-39. doi: 10.18637/jss.v113.i08

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
docs		docs
phenocluster		phenocluster
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Key capabilities

Installation

From PyPI

From source

Quick start

1. Generate a configuration file

2. Edit the configuration

3. Run the pipeline

4. Inspect results

Pipeline overview

CLI reference

Configuration profiles

Configuration reference

Documentation

Testing

License

Citation

Acknowledgment

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Key capabilities

Installation

From PyPI

From source

Quick start

1. Generate a configuration file

2. Edit the configuration

3. Run the pipeline

4. Inspect results

Pipeline overview

CLI reference

Configuration profiles

Configuration reference

Documentation

Testing

License

Citation

Acknowledgment

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages