Skip to content

Commit 9f71813

Browse files
authored
Merge pull request #1490 from azfar-2/main
Databricks Setup Guide
2 parents 0ea2525 + 8345c95 commit 9f71813

File tree

1 file changed

+197
-0
lines changed
  • Databricks Installation & Setup Guide

1 file changed

+197
-0
lines changed
Lines changed: 197 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,197 @@
1+
<div align="center">
2+
3+
# ⚡ Databricks Setup on AWS EC2
4+
5+
[![AWS](https://img.shields.io/badge/AWS-EC2-orange?style=for-the-badge&logo=amazon-aws)](https://aws.amazon.com/)
6+
[![Spark](https://img.shields.io/badge/Apache-Spark-E25A1C?style=for-the-badge&logo=apache-spark)](https://spark.apache.org/)
7+
[![Python](https://img.shields.io/badge/Python-3.8+-3776AB?style=for-the-badge&logo=python)](https://python.org/)
8+
[![Jupyter](https://img.shields.io/badge/Jupyter-Notebook-F37626?style=for-the-badge&logo=jupyter)](https://jupyter.org/)
9+
10+
*Databricks is a popular platform for **Big Data, Machine Learning, and Analytics**.
11+
While the official service is cloud-based, you can replicate the **core Databricks environment** (Spark + MLflow + Delta Lake) on an **AWS EC2 instance** for hands-on learning.* 🚀
12+
13+
</div>
14+
15+
---
16+
17+
## 📋 Table of Contents
18+
- [🛠️ Prerequisites](#️-prerequisites)
19+
- [⚡ Installation Guide](#-installation-guide)
20+
- [🎯 What You Get](#-what-you-get)
21+
- [📚 Next Steps](#-next-steps)
22+
- [💡 Benefits](#-benefits)
23+
24+
---
25+
26+
## 🛠️ Prerequisites
27+
28+
| Requirement | Description |
29+
|-------------|-------------|
30+
| 🏗️ **AWS Account** | Active AWS account with EC2 access |
31+
| 💻 **EC2 Instance** | Ubuntu 20.04 LTS (t2.medium or larger) |
32+
| 🔑 **SSH Access** | Basic knowledge of Linux & SSH |
33+
| 🧠 **Knowledge** | Familiarity with command line operations |
34+
35+
---
36+
37+
## ⚡ Installation Guide
38+
39+
### 🚀 Step 1: Launch an EC2 Instance
40+
41+
> **💡 Pro Tip:** Choose the right instance type for your workload!
42+
43+
1. **Navigate to AWS Console** → EC2 → Launch Instance
44+
2. **Select AMI:** Ubuntu 20.04 LTS
45+
3. **Instance Type:** At least `t2.medium` (2 vCPU, 4GB RAM)
46+
4. **Security Group:** Allow SSH (22) and Jupyter (8888)
47+
48+
```bash
49+
# Connect to your instance
50+
ssh -i your-key.pem ubuntu@<EC2-PUBLIC-IP>
51+
```
52+
53+
---
54+
55+
### 🔧 Step 2: Update & Install Dependencies
56+
57+
```bash
58+
# Update system packages
59+
sudo apt update && sudo apt upgrade -y
60+
61+
# Install essential dependencies
62+
sudo apt install -y openjdk-11-jdk python3-pip git wget
63+
64+
# Verify Java installation
65+
java -version
66+
```
67+
68+
---
69+
70+
### ⚙️ Step 3: Install Apache Spark
71+
72+
```bash
73+
# Download Apache Spark
74+
wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
75+
76+
# Extract and move to /opt
77+
tar xvf spark-3.5.0-bin-hadoop3.tgz
78+
sudo mv spark-3.5.0-bin-hadoop3 /opt/spark
79+
```
80+
81+
**Configure Environment Variables:**
82+
83+
```bash
84+
# Add Spark to PATH
85+
echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc
86+
echo "export PATH=$SPARK_HOME/bin:$PATH" >> ~/.bashrc
87+
source ~/.bashrc
88+
89+
# Verify installation
90+
spark-shell --version
91+
```
92+
93+
---
94+
95+
### 📊 Step 4: Install PySpark & Jupyter
96+
97+
```bash
98+
# Install Python packages
99+
pip3 install pyspark notebook pandas mlflow delta-spark
100+
```
101+
102+
---
103+
104+
### 🔬 Step 5: Launch Jupyter Notebook
105+
106+
```bash
107+
# Start Jupyter with external access
108+
jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser
109+
```
110+
111+
**Access your notebook at:**
112+
```
113+
🌐 http://<EC2-PUBLIC-IP>:8888
114+
```
115+
116+
---
117+
118+
### ✅ Step 6: Test Your Setup
119+
120+
Create a new Python notebook and run this test code:
121+
122+
```python
123+
from pyspark.sql import SparkSession
124+
125+
# Initialize Spark with Delta Lake support
126+
spark = SparkSession.builder \
127+
.appName("EC2-Databricks-Clone") \
128+
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
129+
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
130+
.getOrCreate()
131+
132+
# Test basic functionality
133+
data = spark.range(0, 10)
134+
data.show()
135+
136+
print("🎉 Success! Your Databricks-like environment is ready!")
137+
```
138+
139+
---
140+
141+
## 🎯 What You Get
142+
143+
<div align="center">
144+
145+
| Feature | Status | Description |
146+
|---------|--------|-------------|
147+
|**Apache Spark** || Distributed computing engine |
148+
| 🔺 **Delta Lake** || ACID transactions for big data |
149+
| 🧪 **MLflow** || ML lifecycle management |
150+
| 📊 **Jupyter Notebook** || Interactive development environment |
151+
| 🐍 **PySpark** || Python API for Spark |
152+
153+
</div>
154+
155+
---
156+
157+
## 📚 Next Steps
158+
159+
### 🚀 Advanced Configurations
160+
161+
```bash
162+
# 1. Integrate with S3 for data storage
163+
# 2. Use Terraform to automate EC2 + Spark setup
164+
# 3. Connect MLflow tracking server to store models in S3
165+
# 4. Set up cluster mode for distributed computing
166+
```
167+
168+
### 🛠️ Recommended Enhancements
169+
170+
- [ ] **S3 Integration** - Store data and models in S3
171+
- [ ] **Infrastructure as Code** - Use Terraform for automation
172+
- [ ] **Monitoring** - Set up CloudWatch for instance monitoring
173+
- [ ] **Security** - Configure VPC and proper IAM roles
174+
- [ ] **Scaling** - Implement auto-scaling groups
175+
176+
---
177+
178+
## 💡 Benefits
179+
180+
<div align="center">
181+
182+
🎯 **Perfect for Learning** • 💰 **Cost Control** • 🔧 **Full Customization** • 🚀 **DevOps Practice**
183+
184+
*This EC2 setup gives you the "Databricks feel" without using the managed workspace.
185+
Great for practice, DevOps automation, and understanding the underlying infrastructure!*
186+
187+
</div>
188+
189+
---
190+
191+
<div align="center">
192+
193+
**⭐ Star this repo if it helped you!****🐛 Report issues****🤝 Contribute**
194+
195+
Made with ❤️ by [Azfar](https://www.linkedin.com/in/md-azfar-alam/) for the Data Engineering Community
196+
197+
</div>

0 commit comments

Comments
 (0)