Welcome to the NLP-basics repository! This repository contains a comprehensive collection of Jupyter notebooks designed to teach the foundational concepts of Natural Language Processing (NLP). Through a step-by-step approach, you will learn key techniques such as tokenization, stemming, lemmatization, vectorization methods, and apply machine learning models to real-world datasets.
Below is an overview of the notebooks and files included in this repository:
- An introductory notebook that walks through basic concepts in NLP and provides an overview of the steps required to build a simple NLP pipeline.
- Demonstrates how to split text into meaningful units (tokens), covering word tokenization and sentence tokenization.
- Shows how to reduce words to their root form using stemming techniques like Porter and Snowball stemmers.
- Explores lemmatization, a process that reduces words to their base or dictionary form, considering the context.
- Covers the concept of stop words and how to remove them to clean up text data.
- Walks through part-of-speech tagging, a process that labels words with their respective part of speech (e.g., noun, verb).
- Explains Named Entity Recognition (NER), which is used to identify entities like names, locations, and organizations within text.
- Provides an outline of advanced NLP topics that can be explored after mastering the basics.
- Introduces One-Hot Encoding, a common method for representing categorical data as binary vectors.
- Introduces the Bag of Words (BoW) model, an important text vectorization technique for representing text data as numerical features.
- Explains Term Frequency-Inverse Document Frequency (TF-IDF), a technique to weigh the importance of words in a document relative to a corpus.
- Covers Word2Vec, a popular word embedding model that captures semantic meaning by representing words as vectors in a continuous space.
- A project notebook that demonstrates how to classify SMS messages as spam or ham (not spam) using the Bag of Words model and machine learning algorithms.
- Another spam/ham classification project, this time using TF-IDF for feature extraction, along with machine learning models.
- A project notebook that applies Word2Vec and Average Word2Vec for spam and ham classification, using the vectorized representation of text.
- A sentiment analysis project on Kindle reviews, showcasing how to preprocess reviews and use machine learning models for sentiment classification.
- SMSSpamCollection.txt: A dataset of SMS messages used for spam/ham classification.
- all_kindle_review.csv: A dataset of Kindle reviews used for sentiment analysis.
- finalNLPnotes.pdf: A summary of key NLP concepts covered in the repository.
- Clone the repository:
git clone https://github.com/Abhigyan-RA/NLP-basics.git cd NLP-basics - Install Dependencies:
Make sure you have Python and Jupyter installed. You can install necessary packages by running:
pip install -r requirements.txt