Skip to content

LzLang/Praxisphase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Practical Phase Laszlo Lang

Subject: Robustness of machine learning against Batch effects in RNA-seq data
Supervised by Dr. Maximilien Sprang - The Mayer Lab Mainz
TH-Bingen - Bachelor's Degree Program "Angewandte Bioinformatik" - 2025


RNA sequencing (RNA-seq) enables the systematic study of gene expression and provides insights into biological processes at the molecular level. However, RNA-seq data are often affected by batch effects. These arise from systematic technical variation, for example, due to sequencing runs, sample preparation, or laboratory conditions. Such variation can obscure true biological signals and complicate downstream analyses. Machine learning (ML) methods are powerful tools for detecting complex patterns in high-dimensional data. Yet, in the context of RNA-seq, it remains unclear whether ML models capture genuine biological variation or adapt to technical artifacts.
In this project, we investigate the robustness of selected ML methods against batch effects in RNA-seq data. Classical differential expression analysis is compared with ML-based approaches, and the resulting gene lists are interpreted in biological context using over-representation analysis (ORA) and correlation analysis. This allows us to highlight differences between methods and assess which biological pathways remain consistently detectable despite batch effects.


Chapters:


Python

Packages

  • os
  • time
  • pandas
  • seaborn
  • matplotlib.pyplot
  • numpy
  • plotly.express
  • functoolsreduce
  • sklearn.decompositionPCA
  • sklearn.preprocessingStandardScaler
  • sklearn.linear_modelLogisticRegression
  • sklearn.ensembleRandomForestClassifier
  • sklearn.model_selectiontrain_test_split
  • sklearn.metricsconfusion_matrix, accuracy_score, classification_report, roc_curve, roc_auc_score, f1_score
  • sklearn.inspectionpermutation_importance
  • sklearn.treeDecisionTreeClassifier, plot_tree
  • typingLiteral
  • scipy.statszscore
  • collectionsCounter

Functions to prepare the DataFrame

prepare_dataframe

Input: The dataframe to be transformed
Output: A dataframe containing log2 tarnsformed expression data
Code: Click

read_excel_file

Reads Excel files from each subdirectory in the specified directory.
Input: The path to the directory containing folders with Excel files.
Output: A list of DataFrames read from the Excel files or none, if f an error occurs while reading the files.
Code: Click

merge_dfs

Merges a list of DataFrames into a single DataFrame.
Input: A list of DataFrames to be merged.
Output: A single DataFrame containing all merged data or none, if no DataFrames are provided or if an error occurs.
Code: Click

label_group

Labels the group based on keywords indicating tumor or normal status.
Input: The group to label.
Output: The label for the group ("Normal", "Tumor", or "Other").
Code: Click

calc_log2fc

Calculates the log2 fold change (Log2FC) between normal and tumor samples in the given DataFrame.
Input: dataframe (DataFrame) - The DataFrame containing gene expression data.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output (bool) - Whether to save the log2 fold change results to a CSV file (default is False).
Output: A Series containing the log2 fold change (Log2FC) values
Code: Click


Functions for basic analysis and visualization

heatmap

Creates and displays a heatmap of the given DataFrame.
Input: dataframe (DataFrame) - The DataFrame to create a heatmap from.
Input: output_path (str) - The path where the heatmap will be saved.
Input: output_name (str) - The name of the output heatmap file (default is "heatmap").
Input: method (str) - The method to use for creating the heatmap ('seaborn' or 'plotly').
Input: top_var_genes_num (int) - The number of most variable genes to display in the heatmap (default is 250).
Input: mean_value (float) - The minimum mean value for filtering genes (default is 0).
Input: show (bool) - Whether to display the heatmap (default is True).
Output: None - Displays a heatmap of the DataFrame.
Code: Click

pca

Performs PCA on the given DataFrame and displays a scatter plot of the results.
Input: dataframe (DataFrame) - The DataFrame to perform PCA on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output_path (str) - The path where the PCA plot will be saved.
Input: output_name (str) - The name of the output PCA plot file (default is "pca").
Input: group (str) - The column name in metadata to use for grouping samples.
Input: show (bool) - Whether to display the PCA plot (default is True).
Output: None - Displays a PCA plot of the DataFrame.
Code: Click

distribution_plot

Creates and displays a distribution plot of the specified column in the DataFrame.
Input: dataframe (DataFrame) - The DataFrame to create a distribution plot from.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output_path (str) - The path where the distribution plot will be saved.
Input: output_name (str) - The name of the output distribution plot file (default is "distribution_plot").
Input: show (bool) - Whether to display the distribution plot (default is True).
Output: None - Displays a distribution plot of the specified column.
Code: Click

mean_expression

Calculates the mean expression values for each group in the specified column of the metadata.
Input: dataframe (DataFrame) - The DataFrame to calculate mean expression from.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: output_path (str) - The path where the mean expression plot will be saved.
Input: output_name (str) - The name of the output mean expression plot file (default is "mean_expression").
Input: show (bool) - Whether to display the mean expression plot (default is True).
Output: None - Displays a boxen plot of mean expression values by GEO_Series.
/ Code: Click


Functions for machine learning

logistic_regression

Performs logistic regression on the given DataFrame and returns the model, training and test sets.
Input: dataframe (DataFrame) - The DataFrame to perform logistic regression on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Output: A dictionary containing the results of the logistic regression analysis.
Code: Click

random_forest_classification

Performs random forest classification on the given DataFrame and returns the model, training and test sets.
Input: dataframe (DataFrame) - The DataFrame to perform random forest classification on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Output: A dictionary containing the results of the random forest classification analysis.
Code: Click

decision_tree

NOTE: This functions was never realy used.
Performs decision tree analysis on the given DataFrame and displays a plot of the results.
Input: dataframe (DataFrame) - The DataFrame to perform decision tree analysis on.
Input: metadata (DataFrame) - The metadata DataFrame containing sample information.
Input: method (str) - The method/criterion to use for the decision tree ('gini' or 'entropy').
Output: A dictionary containing the results of the decision tree analysis.
Code: Click

plot_logistic_regression

Plots the confusion matrix and ROC curve for the logistic regression results.
Input: logreg_result (dict) - The result dictionary from logistic regression analysis.
Input: output_path (str) - The path where the logistic regression plot will be saved.
Input: output_name (str) - The name of the output logistic regression plot file (default is "logistic_regression").
Input: show (bool) - Whether to display the logistic regression plot (default is False).
Output: None - Displays a confusion matrix and ROC curve for the logistic regression results.
Code: Click

plot_decision_tree

Plots the confusion matrix and decision tree for the decision tree results.
Input: tree_result (dict) - The result dictionary from decision tree analysis.
Input: output_path (str - The path where the decision tree plot will be saved.
Input: output_name (str) - The name of the output decision tree plot file (default is "decision_tree").
Input: show (bool) - Whether to display the decision tree plot (default is False).
Output: None - Displays a confusion matrix and decision tree plot for the decision tree results.
Code: Click

plot_feature_importance

Calculates feature importance using the specified method and displays a plot of the results.
Input: algorithm_result (dict) - The result dictionary from the algorithm analysis.
Input: output_path (str) - The path where the feature importance plot will be saved.
Input: output_name (str) - The name of the output feature importance plot file (default is "feature_importance").
Input: method (str) - The method to use for calculating feature importance ('logistic_regression' or 'forest_classification').
Input: show (bool) - Whether to display the feature importance plot (default is False).
Output: None - Displays a plot of feature importance based on the specified method.
Code: Click

plot_perm_importance

Calculates permutation importance for the specified algorithm results and displays a plot of the results.
Input: algorithm_result (dict) - The result dictionary from the algorithm analysis.
Input: output_path (str) - The path where the permutation importance plot will be saved.
Input: output_name (str) - The name of the output permutation importance plot file (default is "permutation_importance").
Input: show (bool) - Whether to display the permutation importance plot (default is False).
Output: None - Displays a plot of permutation importance for the specified algorithm results.
Code: Click


R

Packages

About

Robustness of machine-learning against batch effects in RNA-seq data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors