Skip to content

Kunal-Kumar-Das191049/Spotify-End-To-End-Azure-Data-Engineering-Project-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

136 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎧 Spotify End-To-End Azure Data Engineering Project

πŸ“Œ Project Overview

This project demonstrates a complete end-to-end Data Engineering pipeline built on Microsoft Azure + Databricks ecosystem using Medallion Architecture (Bronze β†’ Silver β†’ Gold).

The pipeline ingests data from Azure SQL Database & GitHub, processes it using Azure Data Factory, transforms and models data using Databricks & Delta Live Tables, and finally exposes curated datasets through a Data Warehouse layer.

🧰 Tech Stack

☁️ Cloud & Storage

  • Azure Cloud
  • Azure Data Lake Storage (ADLS Gen2)

πŸ”„ Data Ingestion & Orchestration

  • Azure Data Factory
  • Logic Apps
  • Incremental Pipelines
  • Backfilling Pipelines
  • Loop-based Dynamic Pipelines

⚑ Data Processing

  • Azure Databricks
  • Spark Structured Streaming
  • Databricks Autoloader
  • PySpark
  • Python Utilities
  • Metadata Driven Pipelines (Jinja2)

🧱 Data Modeling

  • Unity Catalog
  • Star Schema
  • Slowly Changing Dimensions (SCD Type 2)
  • Delta Live Tables (DLT)

πŸš€ DevOps & CI/CD

  • GitHub
  • Databricks Asset Bundles
  • Git Branch Collaboration

✨ Project Features

  • βœ… Incremental Data Processing
  • βœ… Backfilling Support
  • βœ… Real-time Stream Processing
  • βœ… Metadata Driven Code
  • βœ… Dynamic Pipeline Execution
  • βœ… Star Schema Data Modeling
  • βœ… CI/CD Integration
  • βœ… Custom PySpark Utilities

Project Architecture

WhatsApp Image 2026-01-29 at 6 55 12 AM

This project includes Data ingestion using Azure Data factory where I have build realtime dynamic pipeline with backfilling capabilities. I have used Azure Databricks and Spark structured streaming and Autoloader for Big data processing within Python utilities. I have build slowly changing dimensions and star schema using DLT and lakeflow pipelines. I have also used Databricks Asset Bundles to Push and deploy my code to GitHub.

πŸ“Š Data Flow Explanation

πŸ”Ή Data Sources

  • Azure SQL Database
  • GitHub (Static Files)

πŸ₯‰ Bronze Layer (Raw Data)

πŸ‘‰ Implemented using Azure Data Factory

  • Incremental ingestion from SQL Database
  • Backfilling support
  • Dynamic pipeline execution
  • CDC-based loading
  • Stored data in ADLS in Parquet format

πŸ₯ˆ Silver Layer (Cleaned & Transformed Data)

πŸ‘‰ Implemented using Azure Databricks

  • Unity Catalog & Metastore configuration
  • External locations & credential setup
  • Spark Structured Streaming using Autoloader
  • PySpark transformations
  • Metadata-driven joins using Jinja2

πŸ₯‡ Gold Layer (Business Ready Data)

πŸ‘‰ Implemented using Delta Live Tables

  • Star Schema implementation
  • SCD Type 2 implementation
  • Final curated datasets

🏬 Warehouse Layer

  • Exposes analytical datasets
  • Provides endpoints for downstream applications

☁️ Azure Infrastructure Setup

πŸ”Ή Resources Used

  • Access Connector for Azure Databricks
  • Azure Databricks
  • Azure Data Factory
  • Logic Apps
  • Azure SQL Database
  • Azure SQL Server
  • Azure Data Lake Storage(ADLS) Gen2
  • Outlook API Connection

Azure Resource Group


πŸ—„οΈ Storage Layer (ADLS Gen2)

Containers

ADLS Containers

Bronze Layer Structure

Bronze Layer


πŸ”„ Azure Data Factory Implementation

Git configuration with Azure Data Factory

Git Configuration with ADF

Linked Services for connecting ADF with SQL database and ADLS

Connecting ADF with SQL database and ADLS

Dynamic pipeline to incrementally load the raw data from SQL Database to ADLS(Bronze Layer)

Pipeline for Incrementally Load the raw data from SQL database to ADLS(Bronze Layer)

Pipeline Features

  • Incremental Loading
  • Backfilling Support

Datasets Used

  1. azure sql - To define the location of data present in SQL Database
  2. json_dynamic - To dynamically define the location of json data present in ADLS
  3. parquet_dynamic - To dynamically define the location of parquet file format data present in ADLS

Pipeline Parameters

Parameters used in the pipeline - incremental_ingestion

Pipeline Variables

Variables used in the pipeline - incremental_ingestion

Clone of same Pipeline with Loop and Alerts (By using Logic App)

Pipeline for Incrementally Load the raw data from SQL database to ADLS(Bronze Layer) using Loop

Pipeline Parameters

Parameter used in the pipeline - incremental_loop

Pipeline Variables

Variables used in the pipeline - incremental_loop

Logic Apps Integration

Used for sending notifications via Outlook when pipeline failed. Logic app designer Page

Setting the trigger - When an HTTP request is received Logic app - Setting the activity - When an HTTP request is received

Setting the action - Send an Email(V2) Logic app - Setting the action - Send an Email(V2)


Databricks Implementation

Workspace Overview

Databricks Dashboard

Access connector Configuration

Access connector for databricks dashboard

Role Assignment

In ADLS I have added a role assignment (Storage Blob Data contributor) to the managed Identity ("AccessConnectorForDatabricks_SpotifyProject"). So that databricks can access the ADLS data with this access connector. Storage Blob data contributor role assignment to databricks access connector


Unity Catalog Setup

Created a dedicated container named "databricksmetastore" in ADLS for Databricks ADLS containers

Deleted the default metastore and created a new metastore New Metstore Created

Workspace associated to the metastore Assigned workspace to the new metastore

Catalog creation in Databricks workspace catalog creation in databricks workspace


External Locations & Credentials

Created a credential and external locations for ADLS bronze, Silver and Gold layer. Credential creation External Locations


πŸ”„ Streaming & Transformations

Databricks Asset Bundle

Databricks Asset Bundle

Autoloader + PySpark Transformations

Created a notebook - Used Autoloader to load the data in Stream and pyspark transformations to transform the data Pyspark transformation - 1

Pyspark transformation - 2 Pyspark transformation - 3 Pyspark transformation - 4 Pyspark transformation - 5 Pyspark transformation - 6 Pyspark transformation - 7 Pyspark transformation - 8 Pyspark transformation - 9 Pyspark transformation - 10 Pyspark transformation - 11

Custom Utility Functions

Created a Utility File to create a reusable method to delete columns Reusable method for deleting columns


Silver Layer Tables

written the processed table in the databricks catalog inside silver schema. Catalog - Silver Schema - Tables


🧠 Metadata Driven Pipelines (Jinja2)

created another notebook named "jinja_notebook" and used Jinja to dynamically apply joins between FactStream table , dimuser table and dimtrack table present in the databricks Unity catalog in Silver Schema. Jinja -1

jinja - 2 jinja - 3 Jinja - 4 Jinja 5

πŸ₯‡ Gold Layer Implementation (DLT)

Star Schema + SCD Type 2

Created a new ETL pipeline in databricks using DLT for creating star schema and slowly changing dimension ETL Pipeline using DLT 1

ETL Pipeline using DLT 2 ETL Pipeline using DLT 3 ETL Pipeline using DLT 4 ETL Pipeline using DLT 5

DAG View for the ETL Pipeline

DAG for Gold Pipeline

SCD Type 2 Output

SCD Type 2 Successfully implemented in the gold layer data SCD Type 2 Successfully Implemented


Gold Schema Tables

All tables has successfully transferred to the Gold schema in the catalog Gold layer schema tables


πŸš€ CI/CD Using Databricks Asset Bundles

Deployed the asset bundle in the dev environment and it created.bundle

Bundle Configuration - databricks.yml

databricks yml file

Bundle Deployment

bundle file

🎯 Final Outcome

  • Fully automated Azure Data Engineering pipeline
  • Real-time streaming ingestion
  • Enterprise-grade medallion architecture
  • Production-ready CI/CD implementation
  • Scalable metadata-driven framework

πŸ‘¨β€πŸ’» Author

Kunal Kumar Das

About

Built an end-to-end data platform using Azure Data Factory, Databricks, Delta Live Tables, Unity Catalog, and CI/CD.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors