This project demonstrates a complete end-to-end Data Engineering pipeline built on Microsoft Azure + Databricks ecosystem using Medallion Architecture (Bronze β Silver β Gold).
The pipeline ingests data from Azure SQL Database & GitHub, processes it using Azure Data Factory, transforms and models data using Databricks & Delta Live Tables, and finally exposes curated datasets through a Data Warehouse layer.
- Azure Cloud
- Azure Data Lake Storage (ADLS Gen2)
- Azure Data Factory
- Logic Apps
- Incremental Pipelines
- Backfilling Pipelines
- Loop-based Dynamic Pipelines
- Azure Databricks
- Spark Structured Streaming
- Databricks Autoloader
- PySpark
- Python Utilities
- Metadata Driven Pipelines (Jinja2)
- Unity Catalog
- Star Schema
- Slowly Changing Dimensions (SCD Type 2)
- Delta Live Tables (DLT)
- GitHub
- Databricks Asset Bundles
- Git Branch Collaboration
- β Incremental Data Processing
- β Backfilling Support
- β Real-time Stream Processing
- β Metadata Driven Code
- β Dynamic Pipeline Execution
- β Star Schema Data Modeling
- β CI/CD Integration
- β Custom PySpark Utilities
This project includes Data ingestion using Azure Data factory where I have build realtime dynamic pipeline with backfilling capabilities. I have used Azure Databricks and Spark structured streaming and Autoloader for Big data processing within Python utilities. I have build slowly changing dimensions and star schema using DLT and lakeflow pipelines. I have also used Databricks Asset Bundles to Push and deploy my code to GitHub.
- Azure SQL Database
- GitHub (Static Files)
π Implemented using Azure Data Factory
- Incremental ingestion from SQL Database
- Backfilling support
- Dynamic pipeline execution
- CDC-based loading
- Stored data in ADLS in Parquet format
π Implemented using Azure Databricks
- Unity Catalog & Metastore configuration
- External locations & credential setup
- Spark Structured Streaming using Autoloader
- PySpark transformations
- Metadata-driven joins using Jinja2
π Implemented using Delta Live Tables
- Star Schema implementation
- SCD Type 2 implementation
- Final curated datasets
- Exposes analytical datasets
- Provides endpoints for downstream applications
- Access Connector for Azure Databricks
- Azure Databricks
- Azure Data Factory
- Logic Apps
- Azure SQL Database
- Azure SQL Server
- Azure Data Lake Storage(ADLS) Gen2
- Outlook API Connection
- Incremental Loading
- Backfilling Support
- azure sql - To define the location of data present in SQL Database
- json_dynamic - To dynamically define the location of json data present in ADLS
- parquet_dynamic - To dynamically define the location of parquet file format data present in ADLS
Used for sending notifications via Outlook when pipeline failed.

Setting the trigger - When an HTTP request is received

Setting the action - Send an Email(V2)

In ADLS I have added a role assignment (Storage Blob Data contributor) to the managed Identity ("AccessConnectorForDatabricks_SpotifyProject"). So that databricks can access the ADLS data with this access connector.

Created a dedicated container named "databricksmetastore" in ADLS for Databricks

Deleted the default metastore and created a new metastore

Workspace associated to the metastore

Catalog creation in Databricks workspace

Created a credential and external locations for ADLS bronze, Silver and Gold layer.

Created a notebook - Used Autoloader to load the data in Stream and pyspark transformations to transform the data

Created a Utility File to create a reusable method to delete columns

written the processed table in the databricks catalog inside silver schema.

created another notebook named "jinja_notebook" and used Jinja to dynamically apply joins between FactStream table , dimuser table and dimtrack table present in the databricks Unity catalog in Silver Schema.

Created a new ETL pipeline in databricks using DLT for creating star schema and slowly changing dimension

SCD Type 2 Successfully implemented in the gold layer data

All tables has successfully transferred to the Gold schema in the catalog

Deployed the asset bundle in the dev environment and it created.bundle
- Fully automated Azure Data Engineering pipeline
- Real-time streaming ingestion
- Enterprise-grade medallion architecture
- Production-ready CI/CD implementation
- Scalable metadata-driven framework
Kunal Kumar Das



