YouTube Data Analysis with Airflow

Overview

This project demonstrates the use of Apache Airflow to build a scalable data pipeline for processing YouTube video data. The pipeline automates tasks such as scraping data, preprocessing it, performing sentiment analysis on video descriptions, and generating insightful visualizations.

Features

Automated scraping of YouTube video data.
Data preprocessing to clean and standardize data.
Sentiment analysis using the Hugging Face Transformers library.
Visualizations of key metrics such as top videos by view count and word clouds of video titles.
Full pipeline orchestration using Apache Airflow.

Project Structure

project/
├── dags/
│   ├── analysis_dag.py          # DAG for sentiment analysis and visualizations
│   ├── preprocessing_dag.py     # DAG for data preprocessing
├── data/
│   ├── scraping/                # Raw scraped data
│   ├── processed/               # Processed data
│   ├── analysis/                # Analysis outputs (e.g., charts, sentiment results)
├── logs/                        # Airflow logs
├── config/                      # Configuration files (e.g., Airflow, environment variables)
├── report.pdf                   # Detailed project report
├── requirements.txt             # Python dependencies
└── README.md                    # Project documentation

Requirements

Python 3.8 or higher
Apache Airflow 2.x
Docker (optional, for containerized setup)
Required Python packages (listed in requirements.txt)

Setup Instructions

1. Clone the Repository

git clone https://github.com/hamzaben404/YouTube-Data-Analysis-with-Airflow
cd YouTube-Data-Analysis-with-Airflow

2. Install Dependencies

Install the necessary Python packages:

pip install -r requirements.txt

3. Set Up Airflow

Option A: Using Docker Compose

Make sure Docker is installed on your machine.
Start Airflow using Docker Compose:
```
docker-compose up --build
```
Access the Airflow web interface at http://localhost:8080.

Option B: Local Installation

Initialize Airflow:
```
airflow db init
```
Start the Airflow webserver:
```
airflow webserver -p 8080
```
Start the Airflow scheduler in another terminal:
```
airflow scheduler
```

4. Configure Airflow Connections

Ensure that all necessary Airflow connections (e.g., API keys, database connections) are configured in the Airflow UI under "Admin > Connections."

5. Run DAGs

Trigger the preprocessing_dag to process raw scraped YouTube data.
Once preprocessing is complete, the analysis_dag will automatically execute (or can be manually triggered) to perform sentiment analysis and generate visualizations.

Outputs

Processed Data: Stored in the data/processed/ directory.
Sentiment Analysis Results: Stored as a CSV file in the data/analysis/ directory.
Visualizations: Charts and word clouds saved in the data/analysis/ directory.

Challenges Overcome

Handling Missing Data: Used robust data cleaning techniques in the preprocessing step.
Sentiment Analysis: Truncated video descriptions to avoid token length errors in Hugging Face models.
XCom Issues: Ensured proper data passing between tasks in Airflow DAGs by explicitly managing file paths.

Future Enhancements

Extend analysis to include metrics like comment sentiment and engagement rates.
Implement periodic scheduling to keep data and visualizations updated.
Incorporate advanced NLP models for deeper insights.
Optimize pipeline performance and reduce DAG complexity.

Contributing

We welcome contributions to enhance the project! To contribute:

Fork the repository.
Create a feature branch:
```
git checkout -b feature-name
```
Commit your changes:
```
git commit -m "Add your message here"
```
Push your branch and submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
dags		dags
data/analysis		data/analysis
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YouTube Data Analysis with Airflow

Overview

Features

Project Structure

Requirements

Setup Instructions

1. Clone the Repository

2. Install Dependencies

3. Set Up Airflow

Option A: Using Docker Compose

Option B: Local Installation

4. Configure Airflow Connections

5. Run DAGs

Outputs

Challenges Overcome

Future Enhancements

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

hamzaben404/YouTube-Data-Analysis-with-Airflow

Folders and files

Latest commit

History

Repository files navigation

YouTube Data Analysis with Airflow

Overview

Features

Project Structure

Requirements

Setup Instructions

1. Clone the Repository

2. Install Dependencies

3. Set Up Airflow

Option A: Using Docker Compose

Option B: Local Installation

4. Configure Airflow Connections

5. Run DAGs

Outputs

Challenges Overcome

Future Enhancements

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages