Practical Phase Laszlo Lang

Subject: Robustness of machine learning against Batch effects in RNA-seq data
Supervised by Dr. Maximilien Sprang - The Mayer Lab Mainz
TH-Bingen - Bachelor's Degree Program "Angewandte Bioinformatik" - 2025

RNA sequencing (RNA-seq) enables the systematic study of gene expression and provides insights into biological processes at the molecular level. However, RNA-seq data are often affected by batch effects. These arise from systematic technical variation, for example, due to sequencing runs, sample preparation, or laboratory conditions. Such variation can obscure true biological signals and complicate downstream analyses. Machine learning (ML) methods are powerful tools for detecting complex patterns in high-dimensional data. Yet, in the context of RNA-seq, it remains unclear whether ML models capture genuine biological variation or adapt to technical artifacts.
In this project, we investigate the robustness of selected ML methods against batch effects in RNA-seq data. Classical differential expression analysis is compared with ML-based approaches, and the resulting gene lists are interpreted in biological context using over-representation analysis (ORA) and correlation analysis. This allows us to highlight differences between methods and assess which biological pathways remain consistently detectable despite batch effects.

Python

Packages

os
time
pandas
seaborn
matplotlib.pyplot
numpy
plotly.express
functools → reduce
sklearn.decomposition → PCA
sklearn.preprocessing → StandardScaler
sklearn.linear_model → LogisticRegression
sklearn.ensemble → RandomForestClassifier
sklearn.model_selection → train_test_split
sklearn.metrics → confusion_matrix, accuracy_score, classification_report, roc_curve, roc_auc_score, f1_score
sklearn.inspection → permutation_importance
sklearn.tree → DecisionTreeClassifier, plot_tree
typing → Literal
scipy.stats → zscore
collections → Counter

Functions to prepare the DataFrame

prepare_dataframe

Input: The dataframe to be transformed
Output: A dataframe containing log2 tarnsformed expression data
Code: Click

read_excel_file

Reads Excel files from each subdirectory in the specified directory.
Input: The path to the directory containing folders with Excel files.
Output: A list of DataFrames read from the Excel files or none, if f an error occurs while reading the files.
Code: Click

merge_dfs

Merges a list of DataFrames into a single DataFrame.
Input: A list of DataFrames to be merged.
Output: A single DataFrame containing all merged data or none, if no DataFrames are provided or if an error occurs.
Code: Click

label_group

Labels the group based on keywords indicating tumor or normal status.
Input: The group to label.
Output: The label for the group ("Normal", "Tumor", or "Other").
Code: Click

calc_log2fc

Calculates the log2 fold change (Log2FC) between normal and tumor samples in the given DataFrame.
Input: dataframe (DataFrame) - The DataFrame containing gene expression data.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output (bool) - Whether to save the log2 fold change results to a CSV file (default is False).
Output: A Series containing the log2 fold change (Log2FC) values
Code: Click

Functions for basic analysis and visualization

heatmap

Creates and displays a heatmap of the given DataFrame.
Input: dataframe (DataFrame) - The DataFrame to create a heatmap from.
Input: output_path (str) - The path where the heatmap will be saved.
Input: output_name (str) - The name of the output heatmap file (default is "heatmap").
Input: method (str) - The method to use for creating the heatmap ('seaborn' or 'plotly').
Input: top_var_genes_num (int) - The number of most variable genes to display in the heatmap (default is 250).
Input: mean_value (float) - The minimum mean value for filtering genes (default is 0).
Input: show (bool) - Whether to display the heatmap (default is True).
Output: None - Displays a heatmap of the DataFrame.
Code: Click

pca

Performs PCA on the given DataFrame and displays a scatter plot of the results.
Input: dataframe (DataFrame) - The DataFrame to perform PCA on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output_path (str) - The path where the PCA plot will be saved.
Input: output_name (str) - The name of the output PCA plot file (default is "pca").
Input: group (str) - The column name in metadata to use for grouping samples.
Input: show (bool) - Whether to display the PCA plot (default is True).
Output: None - Displays a PCA plot of the DataFrame.
Code: Click

distribution_plot

Creates and displays a distribution plot of the specified column in the DataFrame.
Input: dataframe (DataFrame) - The DataFrame to create a distribution plot from.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output_path (str) - The path where the distribution plot will be saved.
Input: output_name (str) - The name of the output distribution plot file (default is "distribution_plot").
Input: show (bool) - Whether to display the distribution plot (default is True).
Output: None - Displays a distribution plot of the specified column.
Code: Click

mean_expression

Calculates the mean expression values for each group in the specified column of the metadata.
Input: dataframe (DataFrame) - The DataFrame to calculate mean expression from.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output_path (str) - The path where the mean expression plot will be saved.
Input: output_name (str) - The name of the output mean expression plot file (default is "mean_expression").
Input: show (bool) - Whether to display the mean expression plot (default is True).
Output: None - Displays a boxen plot of mean expression values by GEO_Series.
/ Code: Click

Functions for machine learning

logistic_regression

Performs logistic regression on the given DataFrame and returns the model, training and test sets.
Input: dataframe (DataFrame) - The DataFrame to perform logistic regression on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Output: A dictionary containing the results of the logistic regression analysis.
Code: Click

random_forest_classification

Performs random forest classification on the given DataFrame and returns the model, training and test sets.
Input: dataframe (DataFrame) - The DataFrame to perform random forest classification on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Output: A dictionary containing the results of the random forest classification analysis.
Code: Click

decision_tree

NOTE: This functions was never realy used.
Performs decision tree analysis on the given DataFrame and displays a plot of the results.
Input: dataframe (DataFrame) - The DataFrame to perform decision tree analysis on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: method (str) - The method/criterion to use for the decision tree ('gini' or 'entropy').
Output: A dictionary containing the results of the decision tree analysis.
Code: Click

plot_logistic_regression

Plots the confusion matrix and ROC curve for the logistic regression results.
Input: logreg_result (dict) - The result dictionary from logistic regression analysis.
Input: output_path (str) - The path where the logistic regression plot will be saved.
Input: output_name (str) - The name of the output logistic regression plot file (default is "logistic_regression").
Input: show (bool) - Whether to display the logistic regression plot (default is False).
Output: None - Displays a confusion matrix and ROC curve for the logistic regression results.
Code: Click

plot_decision_tree

Plots the confusion matrix and decision tree for the decision tree results.
Input: tree_result (dict) - The result dictionary from decision tree analysis.
Input: output_path (str - The path where the decision tree plot will be saved.
Input: output_name (str) - The name of the output decision tree plot file (default is "decision_tree").
Input: show (bool) - Whether to display the decision tree plot (default is False).
Output: None - Displays a confusion matrix and decision tree plot for the decision tree results.
Code: Click

plot_feature_importance

Calculates feature importance using the specified method and displays a plot of the results.
Input: algorithm_result (dict) - The result dictionary from the algorithm analysis.
Input: output_path (str) - The path where the feature importance plot will be saved.
Input: output_name (str) - The name of the output feature importance plot file (default is "feature_importance").
Input: method (str) - The method to use for calculating feature importance ('logistic_regression' or 'forest_classification').
Input: show (bool) - Whether to display the feature importance plot (default is False).
Output: None - Displays a plot of feature importance based on the specified method.
Code: Click

plot_perm_importance

Calculates permutation importance for the specified algorithm results and displays a plot of the results.
Input: algorithm_result (dict) - The result dictionary from the algorithm analysis.
Input: output_path (str) - The path where the permutation importance plot will be saved.
Input: output_name (str) - The name of the output permutation importance plot file (default is "permutation_importance").
Input: show (bool) - Whether to display the permutation importance plot (default is False).
Output: None - Displays a plot of permutation importance for the specified algorithm results.
Code: Click

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
LaTeX		LaTeX
R-Project		R-Project
data		data
methods		methods
output		output
Readme.md		Readme.md
TH_Logo.png		TH_Logo.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Practical Phase Laszlo Lang

Chapters:

Python

Packages

Functions to prepare the DataFrame

prepare_dataframe

read_excel_file

merge_dfs

label_group

calc_log2fc

Functions for basic analysis and visualization

heatmap

pca

distribution_plot

mean_expression

Functions for machine learning

logistic_regression

random_forest_classification

decision_tree

plot_logistic_regression

plot_decision_tree

plot_feature_importance

plot_perm_importance

R

Packages

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Practical Phase Laszlo Lang

Chapters:

Python

Packages

Functions to prepare the DataFrame

prepare_dataframe

read_excel_file

merge_dfs

label_group

calc_log2fc

Functions for basic analysis and visualization

heatmap

pca

distribution_plot

mean_expression

Functions for machine learning

logistic_regression

random_forest_classification

decision_tree

plot_logistic_regression

plot_decision_tree

plot_feature_importance

plot_perm_importance

R

Packages

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages