|
| 1 | +<div align="center"> |
| 2 | + |
| 3 | +# ⚡ Databricks Setup on AWS EC2 |
| 4 | + |
| 5 | +[](https://aws.amazon.com/) |
| 6 | +[](https://spark.apache.org/) |
| 7 | +[](https://python.org/) |
| 8 | +[](https://jupyter.org/) |
| 9 | + |
| 10 | +*Databricks is a popular platform for **Big Data, Machine Learning, and Analytics**. |
| 11 | +While the official service is cloud-based, you can replicate the **core Databricks environment** (Spark + MLflow + Delta Lake) on an **AWS EC2 instance** for hands-on learning.* 🚀 |
| 12 | + |
| 13 | +</div> |
| 14 | + |
| 15 | +--- |
| 16 | + |
| 17 | +## 📋 Table of Contents |
| 18 | +- [🛠️ Prerequisites](#️-prerequisites) |
| 19 | +- [⚡ Installation Guide](#-installation-guide) |
| 20 | +- [🎯 What You Get](#-what-you-get) |
| 21 | +- [📚 Next Steps](#-next-steps) |
| 22 | +- [💡 Benefits](#-benefits) |
| 23 | + |
| 24 | +--- |
| 25 | + |
| 26 | +## 🛠️ Prerequisites |
| 27 | + |
| 28 | +| Requirement | Description | |
| 29 | +|-------------|-------------| |
| 30 | +| 🏗️ **AWS Account** | Active AWS account with EC2 access | |
| 31 | +| 💻 **EC2 Instance** | Ubuntu 20.04 LTS (t2.medium or larger) | |
| 32 | +| 🔑 **SSH Access** | Basic knowledge of Linux & SSH | |
| 33 | +| 🧠 **Knowledge** | Familiarity with command line operations | |
| 34 | + |
| 35 | +--- |
| 36 | + |
| 37 | +## ⚡ Installation Guide |
| 38 | + |
| 39 | +### 🚀 Step 1: Launch an EC2 Instance |
| 40 | + |
| 41 | +> **💡 Pro Tip:** Choose the right instance type for your workload! |
| 42 | +
|
| 43 | +1. **Navigate to AWS Console** → EC2 → Launch Instance |
| 44 | +2. **Select AMI:** Ubuntu 20.04 LTS |
| 45 | +3. **Instance Type:** At least `t2.medium` (2 vCPU, 4GB RAM) |
| 46 | +4. **Security Group:** Allow SSH (22) and Jupyter (8888) |
| 47 | + |
| 48 | +```bash |
| 49 | +# Connect to your instance |
| 50 | +ssh -i your-key.pem ubuntu@<EC2-PUBLIC-IP> |
| 51 | +``` |
| 52 | + |
| 53 | +--- |
| 54 | + |
| 55 | +### 🔧 Step 2: Update & Install Dependencies |
| 56 | + |
| 57 | +```bash |
| 58 | +# Update system packages |
| 59 | +sudo apt update && sudo apt upgrade -y |
| 60 | + |
| 61 | +# Install essential dependencies |
| 62 | +sudo apt install -y openjdk-11-jdk python3-pip git wget |
| 63 | + |
| 64 | +# Verify Java installation |
| 65 | +java -version |
| 66 | +``` |
| 67 | + |
| 68 | +--- |
| 69 | + |
| 70 | +### ⚙️ Step 3: Install Apache Spark |
| 71 | + |
| 72 | +```bash |
| 73 | +# Download Apache Spark |
| 74 | +wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz |
| 75 | + |
| 76 | +# Extract and move to /opt |
| 77 | +tar xvf spark-3.5.0-bin-hadoop3.tgz |
| 78 | +sudo mv spark-3.5.0-bin-hadoop3 /opt/spark |
| 79 | +``` |
| 80 | + |
| 81 | +**Configure Environment Variables:** |
| 82 | + |
| 83 | +```bash |
| 84 | +# Add Spark to PATH |
| 85 | +echo "export SPARK_HOME=/opt/spark" >> ~/.bashrc |
| 86 | +echo "export PATH=$SPARK_HOME/bin:$PATH" >> ~/.bashrc |
| 87 | +source ~/.bashrc |
| 88 | + |
| 89 | +# Verify installation |
| 90 | +spark-shell --version |
| 91 | +``` |
| 92 | + |
| 93 | +--- |
| 94 | + |
| 95 | +### 📊 Step 4: Install PySpark & Jupyter |
| 96 | + |
| 97 | +```bash |
| 98 | +# Install Python packages |
| 99 | +pip3 install pyspark notebook pandas mlflow delta-spark |
| 100 | +``` |
| 101 | + |
| 102 | +--- |
| 103 | + |
| 104 | +### 🔬 Step 5: Launch Jupyter Notebook |
| 105 | + |
| 106 | +```bash |
| 107 | +# Start Jupyter with external access |
| 108 | +jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser |
| 109 | +``` |
| 110 | + |
| 111 | +**Access your notebook at:** |
| 112 | +``` |
| 113 | +🌐 http://<EC2-PUBLIC-IP>:8888 |
| 114 | +``` |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +### ✅ Step 6: Test Your Setup |
| 119 | + |
| 120 | +Create a new Python notebook and run this test code: |
| 121 | + |
| 122 | +```python |
| 123 | +from pyspark.sql import SparkSession |
| 124 | + |
| 125 | +# Initialize Spark with Delta Lake support |
| 126 | +spark = SparkSession.builder \ |
| 127 | + .appName("EC2-Databricks-Clone") \ |
| 128 | + .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \ |
| 129 | + .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \ |
| 130 | + .getOrCreate() |
| 131 | + |
| 132 | +# Test basic functionality |
| 133 | +data = spark.range(0, 10) |
| 134 | +data.show() |
| 135 | + |
| 136 | +print("🎉 Success! Your Databricks-like environment is ready!") |
| 137 | +``` |
| 138 | + |
| 139 | +--- |
| 140 | + |
| 141 | +## 🎯 What You Get |
| 142 | + |
| 143 | +<div align="center"> |
| 144 | + |
| 145 | +| Feature | Status | Description | |
| 146 | +|---------|--------|-------------| |
| 147 | +| ⚡ **Apache Spark** | ✅ | Distributed computing engine | |
| 148 | +| 🔺 **Delta Lake** | ✅ | ACID transactions for big data | |
| 149 | +| 🧪 **MLflow** | ✅ | ML lifecycle management | |
| 150 | +| 📊 **Jupyter Notebook** | ✅ | Interactive development environment | |
| 151 | +| 🐍 **PySpark** | ✅ | Python API for Spark | |
| 152 | + |
| 153 | +</div> |
| 154 | + |
| 155 | +--- |
| 156 | + |
| 157 | +## 📚 Next Steps |
| 158 | + |
| 159 | +### 🚀 Advanced Configurations |
| 160 | + |
| 161 | +```bash |
| 162 | +# 1. Integrate with S3 for data storage |
| 163 | +# 2. Use Terraform to automate EC2 + Spark setup |
| 164 | +# 3. Connect MLflow tracking server to store models in S3 |
| 165 | +# 4. Set up cluster mode for distributed computing |
| 166 | +``` |
| 167 | + |
| 168 | +### 🛠️ Recommended Enhancements |
| 169 | + |
| 170 | +- [ ] **S3 Integration** - Store data and models in S3 |
| 171 | +- [ ] **Infrastructure as Code** - Use Terraform for automation |
| 172 | +- [ ] **Monitoring** - Set up CloudWatch for instance monitoring |
| 173 | +- [ ] **Security** - Configure VPC and proper IAM roles |
| 174 | +- [ ] **Scaling** - Implement auto-scaling groups |
| 175 | + |
| 176 | +--- |
| 177 | + |
| 178 | +## 💡 Benefits |
| 179 | + |
| 180 | +<div align="center"> |
| 181 | + |
| 182 | +🎯 **Perfect for Learning** • 💰 **Cost Control** • 🔧 **Full Customization** • 🚀 **DevOps Practice** |
| 183 | + |
| 184 | +*This EC2 setup gives you the "Databricks feel" without using the managed workspace. |
| 185 | +Great for practice, DevOps automation, and understanding the underlying infrastructure!* |
| 186 | + |
| 187 | +</div> |
| 188 | + |
| 189 | +--- |
| 190 | + |
| 191 | +<div align="center"> |
| 192 | + |
| 193 | +**⭐ Star this repo if it helped you!** • **🐛 Report issues** • **🤝 Contribute** |
| 194 | + |
| 195 | +Made with ❤️ by [Azfar](https://www.linkedin.com/in/md-azfar-alam/) for the Data Engineering Community |
| 196 | + |
| 197 | +</div> |
0 commit comments