Skip to content

Abhigyan-RA/NLP-basics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NLP Basics Repository

Welcome to the NLP-basics repository! This repository contains a comprehensive collection of Jupyter notebooks designed to teach the foundational concepts of Natural Language Processing (NLP). Through a step-by-step approach, you will learn key techniques such as tokenization, stemming, lemmatization, vectorization methods, and apply machine learning models to real-world datasets.

Repository Structure

Below is an overview of the notebooks and files included in this repository:

1-Lesson.ipynb

  • An introductory notebook that walks through basic concepts in NLP and provides an overview of the steps required to build a simple NLP pipeline.

2-Tokenization.ipynb

  • Demonstrates how to split text into meaningful units (tokens), covering word tokenization and sentence tokenization.

3-Stemming.ipynb

  • Shows how to reduce words to their root form using stemming techniques like Porter and Snowball stemmers.

4-Lemmatization.ipynb

  • Explores lemmatization, a process that reduces words to their base or dictionary form, considering the context.

5-stopWords.ipynb

  • Covers the concept of stop words and how to remove them to clean up text data.

6-POS-tagging.ipynb

  • Walks through part-of-speech tagging, a process that labels words with their respective part of speech (e.g., noun, verb).

7-Named-Entity-Recognition.ipynb

  • Explains Named Entity Recognition (NER), which is used to identify entities like names, locations, and organizations within text.

8-NextSteps.ipynb

  • Provides an outline of advanced NLP topics that can be explored after mastering the basics.

9-One-Hot-Encoding.ipynb

  • Introduces One-Hot Encoding, a common method for representing categorical data as binary vectors.

10-BagofWords.ipynb

  • Introduces the Bag of Words (BoW) model, an important text vectorization technique for representing text data as numerical features.

11-TF-IDF.ipynb

  • Explains Term Frequency-Inverse Document Frequency (TF-IDF), a technique to weigh the importance of words in a document relative to a corpus.

12-Word2Vec.ipynb

  • Covers Word2Vec, a popular word embedding model that captures semantic meaning by representing words as vectors in a continuous space.

13-Spam Ham Classification Project Using BOW And ML.ipynb

  • A project notebook that demonstrates how to classify SMS messages as spam or ham (not spam) using the Bag of Words model and machine learning algorithms.

14-Spam Ham Classification Project Using tf-idf And ML.ipynb

  • Another spam/ham classification project, this time using TF-IDF for feature extraction, along with machine learning models.

15-Spam Ham Projects Using Word2vec, AvgWord2vec.ipynb

  • A project notebook that applies Word2Vec and Average Word2Vec for spam and ham classification, using the vectorized representation of text.

16-Kindle Review Sentiment Analysis.ipynb

  • A sentiment analysis project on Kindle reviews, showcasing how to preprocess reviews and use machine learning models for sentiment classification.

Other Files

  • SMSSpamCollection.txt: A dataset of SMS messages used for spam/ham classification.
  • all_kindle_review.csv: A dataset of Kindle reviews used for sentiment analysis.
  • finalNLPnotes.pdf: A summary of key NLP concepts covered in the repository.

Getting Started

  1. Clone the repository:
    git clone https://github.com/Abhigyan-RA/NLP-basics.git
    cd NLP-basics
  2. Install Dependencies: Make sure you have Python and Jupyter installed. You can install necessary packages by running:
    pip install -r requirements.txt

About

NLP basics using Machine learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors