This project demonstrates the use of Apache Airflow to build a scalable data pipeline for processing YouTube video data. The pipeline automates tasks such as scraping data, preprocessing it, performing sentiment analysis on video descriptions, and generating insightful visualizations.
- Automated scraping of YouTube video data.
- Data preprocessing to clean and standardize data.
- Sentiment analysis using the Hugging Face Transformers library.
- Visualizations of key metrics such as top videos by view count and word clouds of video titles.
- Full pipeline orchestration using Apache Airflow.
project/
├── dags/
│ ├── analysis_dag.py # DAG for sentiment analysis and visualizations
│ ├── preprocessing_dag.py # DAG for data preprocessing
├── data/
│ ├── scraping/ # Raw scraped data
│ ├── processed/ # Processed data
│ ├── analysis/ # Analysis outputs (e.g., charts, sentiment results)
├── logs/ # Airflow logs
├── config/ # Configuration files (e.g., Airflow, environment variables)
├── report.pdf # Detailed project report
├── requirements.txt # Python dependencies
└── README.md # Project documentation
- Python 3.8 or higher
- Apache Airflow 2.x
- Docker (optional, for containerized setup)
- Required Python packages (listed in
requirements.txt)
git clone https://github.com/hamzaben404/YouTube-Data-Analysis-with-Airflow
cd YouTube-Data-Analysis-with-AirflowInstall the necessary Python packages:
pip install -r requirements.txt- Make sure Docker is installed on your machine.
- Start Airflow using Docker Compose:
docker-compose up --build
- Access the Airflow web interface at http://localhost:8080.
- Initialize Airflow:
airflow db init
- Start the Airflow webserver:
airflow webserver -p 8080
- Start the Airflow scheduler in another terminal:
airflow scheduler
Ensure that all necessary Airflow connections (e.g., API keys, database connections) are configured in the Airflow UI under "Admin > Connections."
- Trigger the
preprocessing_dagto process raw scraped YouTube data. - Once preprocessing is complete, the
analysis_dagwill automatically execute (or can be manually triggered) to perform sentiment analysis and generate visualizations.
- Processed Data: Stored in the
data/processed/directory. - Sentiment Analysis Results: Stored as a CSV file in the
data/analysis/directory. - Visualizations: Charts and word clouds saved in the
data/analysis/directory.
- Handling Missing Data: Used robust data cleaning techniques in the preprocessing step.
- Sentiment Analysis: Truncated video descriptions to avoid token length errors in Hugging Face models.
- XCom Issues: Ensured proper data passing between tasks in Airflow DAGs by explicitly managing file paths.
- Extend analysis to include metrics like comment sentiment and engagement rates.
- Implement periodic scheduling to keep data and visualizations updated.
- Incorporate advanced NLP models for deeper insights.
- Optimize pipeline performance and reduce DAG complexity.
We welcome contributions to enhance the project! To contribute:
- Fork the repository.
- Create a feature branch:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add your message here" - Push your branch and submit a pull request.