Subject: Robustness of machine learning against Batch effects in RNA-seq data
Supervised by Dr. Maximilien Sprang - The Mayer Lab Mainz
TH-Bingen - Bachelor's Degree Program "Angewandte Bioinformatik" - 2025
RNA sequencing (RNA-seq) enables the systematic study of gene expression and provides insights into biological processes at the molecular level. However, RNA-seq data are often affected by batch effects.
These arise from systematic technical variation, for example, due to sequencing runs, sample preparation, or laboratory conditions. Such variation can obscure true biological signals and complicate downstream
analyses. Machine learning (ML) methods are powerful tools for detecting complex patterns in high-dimensional data. Yet, in the context of RNA-seq, it remains unclear whether ML models capture genuine
biological variation or adapt to technical artifacts.
In this project, we investigate the robustness of selected ML methods against batch effects in RNA-seq data. Classical differential expression analysis is compared with ML-based approaches, and the resulting
gene lists are interpreted in biological context using over-representation analysis (ORA) and correlation analysis. This allows us to highlight differences between methods and assess which biological pathways
remain consistently detectable despite batch effects.
- Python
- R
ostimepandasseabornmatplotlib.pyplotnumpyplotly.expressfunctools→reducesklearn.decomposition→PCAsklearn.preprocessing→StandardScalersklearn.linear_model→LogisticRegressionsklearn.ensemble→RandomForestClassifiersklearn.model_selection→train_test_splitsklearn.metrics→confusion_matrix,accuracy_score,classification_report,roc_curve,roc_auc_score,f1_scoresklearn.inspection→permutation_importancesklearn.tree→DecisionTreeClassifier,plot_treetyping→Literalscipy.stats→zscorecollections→Counter
Input: The dataframe to be transformed
Output: A dataframe containing log2 tarnsformed expression data
Code: Click
Reads Excel files from each subdirectory in the specified directory.
Input: The path to the directory containing folders with Excel files.
Output: A list of DataFrames read from the Excel files or none, if f an error occurs while reading the files.
Code: Click
Merges a list of DataFrames into a single DataFrame.
Input: A list of DataFrames to be merged.
Output: A single DataFrame containing all merged data or none, if no DataFrames are provided or if an error occurs.
Code: Click
Labels the group based on keywords indicating tumor or normal status.
Input: The group to label.
Output: The label for the group ("Normal", "Tumor", or "Other").
Code: Click
Calculates the log2 fold change (Log2FC) between normal and tumor samples in the given DataFrame.
Input: dataframe (DataFrame) - The DataFrame containing gene expression data.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output (bool) - Whether to save the log2 fold change results to a CSV file (default is False).
Output: A Series containing the log2 fold change (Log2FC) values
Code: Click
Creates and displays a heatmap of the given DataFrame.
Input: dataframe (DataFrame) - The DataFrame to create a heatmap from.
Input: output_path (str) - The path where the heatmap will be saved.
Input: output_name (str) - The name of the output heatmap file (default is "heatmap").
Input: method (str) - The method to use for creating the heatmap ('seaborn' or 'plotly').
Input: top_var_genes_num (int) - The number of most variable genes to display in the heatmap (default is 250).
Input: mean_value (float) - The minimum mean value for filtering genes (default is 0).
Input: show (bool) - Whether to display the heatmap (default is True).
Output: None - Displays a heatmap of the DataFrame.
Code: Click
Performs PCA on the given DataFrame and displays a scatter plot of the results.
Input: dataframe (DataFrame) - The DataFrame to perform PCA on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output_path (str) - The path where the PCA plot will be saved.
Input: output_name (str) - The name of the output PCA plot file (default is "pca").
Input: group (str) - The column name in metadata to use for grouping samples.
Input: show (bool) - Whether to display the PCA plot (default is True).
Output: None - Displays a PCA plot of the DataFrame.
Code: Click
Creates and displays a distribution plot of the specified column in the DataFrame.
Input: dataframe (DataFrame) - The DataFrame to create a distribution plot from.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output_path (str) - The path where the distribution plot will be saved.
Input: output_name (str) - The name of the output distribution plot file (default is "distribution_plot").
Input: show (bool) - Whether to display the distribution plot (default is True).
Output: None - Displays a distribution plot of the specified column.
Code: Click
Calculates the mean expression values for each group in the specified column of the metadata.
Input: dataframe (DataFrame) - The DataFrame to calculate mean expression from.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output_path (str) - The path where the mean expression plot will be saved.
Input: output_name (str) - The name of the output mean expression plot file (default is "mean_expression").
Input: show (bool) - Whether to display the mean expression plot (default is True).
Output: None - Displays a boxen plot of mean expression values by GEO_Series.
/
Code: Click
Performs logistic regression on the given DataFrame and returns the model, training and test sets.
Input: dataframe (DataFrame) - The DataFrame to perform logistic regression on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Output: A dictionary containing the results of the logistic regression analysis.
Code: Click
Performs random forest classification on the given DataFrame and returns the model, training and test sets.
Input: dataframe (DataFrame) - The DataFrame to perform random forest classification on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Output: A dictionary containing the results of the random forest classification analysis.
Code: Click
NOTE: This functions was never realy used.
Performs decision tree analysis on the given DataFrame and displays a plot of the results.
Input: dataframe (DataFrame) - The DataFrame to perform decision tree analysis on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: method (str) - The method/criterion to use for the decision tree ('gini' or 'entropy').
Output: A dictionary containing the results of the decision tree analysis.
Code: Click
Plots the confusion matrix and ROC curve for the logistic regression results.
Input: logreg_result (dict) - The result dictionary from logistic regression analysis.
Input: output_path (str) - The path where the logistic regression plot will be saved.
Input: output_name (str) - The name of the output logistic regression plot file (default is "logistic_regression").
Input: show (bool) - Whether to display the logistic regression plot (default is False).
Output: None - Displays a confusion matrix and ROC curve for the logistic regression results.
Code: Click
Plots the confusion matrix and decision tree for the decision tree results.
Input: tree_result (dict) - The result dictionary from decision tree analysis.
Input: output_path (str - The path where the decision tree plot will be saved.
Input: output_name (str) - The name of the output decision tree plot file (default is "decision_tree").
Input: show (bool) - Whether to display the decision tree plot (default is False).
Output: None - Displays a confusion matrix and decision tree plot for the decision tree results.
Code: Click
Calculates feature importance using the specified method and displays a plot of the results.
Input: algorithm_result (dict) - The result dictionary from the algorithm analysis.
Input: output_path (str) - The path where the feature importance plot will be saved.
Input: output_name (str) - The name of the output feature importance plot file (default is "feature_importance").
Input: method (str) - The method to use for calculating feature importance ('logistic_regression' or 'forest_classification').
Input: show (bool) - Whether to display the feature importance plot (default is False).
Output: None - Displays a plot of feature importance based on the specified method.
Code: Click
Calculates permutation importance for the specified algorithm results and displays a plot of the results.
Input: algorithm_result (dict) - The result dictionary from the algorithm analysis.
Input: output_path (str) - The path where the permutation importance plot will be saved.
Input: output_name (str) - The name of the output permutation importance plot file (default is "permutation_importance").
Input: show (bool) - Whether to display the permutation importance plot (default is False).
Output: None - Displays a plot of permutation importance for the specified algorithm results.
Code: Click