🎧 Spotify End-To-End Azure Data Engineering Project

📌 Project Overview

This project demonstrates a complete end-to-end Data Engineering pipeline built on Microsoft Azure + Databricks ecosystem using Medallion Architecture (Bronze → Silver → Gold).

The pipeline ingests data from Azure SQL Database & GitHub, processes it using Azure Data Factory, transforms and models data using Databricks & Delta Live Tables, and finally exposes curated datasets through a Data Warehouse layer.

🧰 Tech Stack

☁️ Cloud & Storage

Azure Cloud
Azure Data Lake Storage (ADLS Gen2)

🔄 Data Ingestion & Orchestration

Azure Data Factory
Logic Apps
Incremental Pipelines
Backfilling Pipelines
Loop-based Dynamic Pipelines

⚡ Data Processing

Azure Databricks
Spark Structured Streaming
Databricks Autoloader
PySpark
Python Utilities
Metadata Driven Pipelines (Jinja2)

🧱 Data Modeling

Unity Catalog
Star Schema
Slowly Changing Dimensions (SCD Type 2)
Delta Live Tables (DLT)

🚀 DevOps & CI/CD

GitHub
Databricks Asset Bundles
Git Branch Collaboration

✨ Project Features

✅ Incremental Data Processing
✅ Backfilling Support
✅ Real-time Stream Processing
✅ Metadata Driven Code
✅ Dynamic Pipeline Execution
✅ Star Schema Data Modeling
✅ CI/CD Integration
✅ Custom PySpark Utilities

Project Architecture

This project includes Data ingestion using Azure Data factory where I have build realtime dynamic pipeline with backfilling capabilities. I have used Azure Databricks and Spark structured streaming and Autoloader for Big data processing within Python utilities. I have build slowly changing dimensions and star schema using DLT and lakeflow pipelines. I have also used Databricks Asset Bundles to Push and deploy my code to GitHub.

📊 Data Flow Explanation

🔹 Data Sources

Azure SQL Database
GitHub (Static Files)

🥉 Bronze Layer (Raw Data)

👉 Implemented using Azure Data Factory

Incremental ingestion from SQL Database
Backfilling support
Dynamic pipeline execution
CDC-based loading
Stored data in ADLS in Parquet format

🥈 Silver Layer (Cleaned & Transformed Data)

👉 Implemented using Azure Databricks

Unity Catalog & Metastore configuration
External locations & credential setup
Spark Structured Streaming using Autoloader
PySpark transformations
Metadata-driven joins using Jinja2

🥇 Gold Layer (Business Ready Data)

👉 Implemented using Delta Live Tables

Star Schema implementation
SCD Type 2 implementation
Final curated datasets

🏬 Warehouse Layer

Exposes analytical datasets
Provides endpoints for downstream applications

☁️ Azure Infrastructure Setup

🔹 Resources Used

Access Connector for Azure Databricks
Azure Databricks
Azure Data Factory
Logic Apps
Azure SQL Database
Azure SQL Server
Azure Data Lake Storage(ADLS) Gen2
Outlook API Connection

🗄️ Storage Layer (ADLS Gen2)

Containers

Bronze Layer Structure

🔄 Azure Data Factory Implementation

Git configuration with Azure Data Factory

Linked Services for connecting ADF with SQL database and ADLS

Dynamic pipeline to incrementally load the raw data from SQL Database to ADLS(Bronze Layer)

Pipeline for Incrementally Load the raw data from SQL database to ADLS(Bronze Layer)

Pipeline Features

Incremental Loading
Backfilling Support

Datasets Used

azure sql - To define the location of data present in SQL Database
json_dynamic - To dynamically define the location of json data present in ADLS
parquet_dynamic - To dynamically define the location of parquet file format data present in ADLS

Pipeline Parameters

Parameters used in the pipeline - incremental_ingestion

Pipeline Variables

Variables used in the pipeline - incremental_ingestion

Clone of same Pipeline with Loop and Alerts (By using Logic App)

Pipeline for Incrementally Load the raw data from SQL database to ADLS(Bronze Layer) using Loop

Pipeline Parameters

Parameter used in the pipeline - incremental_loop

Pipeline Variables

Variables used in the pipeline - incremental_loop

Logic Apps Integration

Used for sending notifications via Outlook when pipeline failed.

Setting the trigger - When an HTTP request is received

Setting the action - Send an Email(V2)

Databricks Implementation

Workspace Overview

Access connector Configuration

Access connector for databricks dashboard

Role Assignment

In ADLS I have added a role assignment (Storage Blob Data contributor) to the managed Identity ("AccessConnectorForDatabricks_SpotifyProject"). So that databricks can access the ADLS data with this access connector.

Unity Catalog Setup

Created a dedicated container named "databricksmetastore" in ADLS for Databricks

Deleted the default metastore and created a new metastore

Workspace associated to the metastore

Catalog creation in Databricks workspace

External Locations & Credentials

Created a credential and external locations for ADLS bronze, Silver and Gold layer.

🔄 Streaming & Transformations

Databricks Asset Bundle

Autoloader + PySpark Transformations

Created a notebook - Used Autoloader to load the data in Stream and pyspark transformations to transform the data

Custom Utility Functions

Created a Utility File to create a reusable method to delete columns

Silver Layer Tables

written the processed table in the databricks catalog inside silver schema.

🧠 Metadata Driven Pipelines (Jinja2)

created another notebook named "jinja_notebook" and used Jinja to dynamically apply joins between FactStream table , dimuser table and dimtrack table present in the databricks Unity catalog in Silver Schema.

🥇 Gold Layer Implementation (DLT)

Star Schema + SCD Type 2

Created a new ETL pipeline in databricks using DLT for creating star schema and slowly changing dimension

DAG View for the ETL Pipeline

SCD Type 2 Output

SCD Type 2 Successfully implemented in the gold layer data

Gold Schema Tables

All tables has successfully transferred to the Gold schema in the catalog

🚀 CI/CD Using Databricks Asset Bundles

Deployed the asset bundle in the dev environment and it created.bundle

Bundle Configuration - databricks.yml

Bundle Deployment

🎯 Final Outcome

Fully automated Azure Data Engineering pipeline
Real-time streaming ingestion
Enterprise-grade medallion architecture
Production-ready CI/CD implementation
Scalable metadata-driven framework

👨‍💻 Author

Kunal Kumar Das

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
dataset		dataset
factory		factory
linkedService		linkedService
pipeline		pipeline
source_scripts		source_scripts
spotify_dab		spotify_dab
README.md		README.md
publish_config.json		publish_config.json

Folders and files

Latest commit

History

Repository files navigation

🎧 Spotify End-To-End Azure Data Engineering Project

📌 Project Overview

🧰 Tech Stack

☁️ Cloud & Storage

🔄 Data Ingestion & Orchestration

⚡ Data Processing

🧱 Data Modeling

🚀 DevOps & CI/CD

✨ Project Features

Project Architecture

📊 Data Flow Explanation

🔹 Data Sources

🥉 Bronze Layer (Raw Data)

🥈 Silver Layer (Cleaned & Transformed Data)

🥇 Gold Layer (Business Ready Data)

🏬 Warehouse Layer

☁️ Azure Infrastructure Setup

🔹 Resources Used

🗄️ Storage Layer (ADLS Gen2)

Containers

Bronze Layer Structure

🔄 Azure Data Factory Implementation

Git configuration with Azure Data Factory

Linked Services for connecting ADF with SQL database and ADLS

Dynamic pipeline to incrementally load the raw data from SQL Database to ADLS(Bronze Layer)

Pipeline Features

Datasets Used

Pipeline Parameters

Pipeline Variables

Clone of same Pipeline with Loop and Alerts (By using Logic App)

Pipeline Parameters

Pipeline Variables

Logic Apps Integration

Databricks Implementation

Workspace Overview

Access connector Configuration

Role Assignment

Unity Catalog Setup

External Locations & Credentials

🔄 Streaming & Transformations

Databricks Asset Bundle

Autoloader + PySpark Transformations

Custom Utility Functions

Silver Layer Tables

🧠 Metadata Driven Pipelines (Jinja2)

🥇 Gold Layer Implementation (DLT)

Star Schema + SCD Type 2

DAG View for the ETL Pipeline

SCD Type 2 Output

Gold Schema Tables

🚀 CI/CD Using Databricks Asset Bundles

Bundle Configuration - databricks.yml

Bundle Deployment

🎯 Final Outcome

👨‍💻 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages