-
Notifications
You must be signed in to change notification settings - Fork 197
Added the stopword removal transformation #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
fb63d68
4f04626
6977b92
541f10f
f90597b
0c98d2a
980554e
795bc1b
f7448c4
1495063
059b27f
738e404
3d3e655
a6f7620
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| # Stopword Removal | ||
| Removes stopwords from a piece of text. | ||
|
|
||
| Author: Juan Yi Loke | ||
| Email: juanyi.loke@mail.utoronto.ca | ||
| Affliation: University of Toronto | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe you would need to add the Robustness Evaluation as per the instructions in the email :) |
||
| ## What type of a transformation is this? | ||
| By default, this simple stopword removal parses a text, removes stopwords, and returns an untokenized version of the text using nltk's toktok tokenizer and treeword bank detokenizer. All stopwords are based on nltk's library of stopwords. | ||
|
|
||
| ## What tasks does it intend to benefit? | ||
| Removing stopwords is often one of the key steps for text-preprocessing to reduce the size of text data one has to deal with. | ||
|
|
||
| ## What are the limitations of this transformation? | ||
| The library of stopwords are constrained by nltk's library of stopwords. Different libraries like spaCy or gensim may include or exclude certain stopwords that are inside NLTK's library of stopwords. The NLTK library is chosen simply due to its popularity compared to other libraries. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| from .transformation import * |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1 @@ | ||
| nltk |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,50 @@ | ||
| { | ||
| "type": "stopword_removal", | ||
| "test_cases": [ | ||
| { | ||
| "class": "StopwordRemoval", | ||
| "inputs": { | ||
| "sentence": "This is a test." | ||
| }, | ||
| "outputs": [{ | ||
| "sentence": "test." | ||
| }] | ||
| }, | ||
| { | ||
| "class": "StopwordRemoval", | ||
| "inputs": { | ||
| "sentence": "To be or not to be, that is the question?" | ||
| }, | ||
| "outputs": [{ | ||
| "sentence": ", question?" | ||
| }] | ||
| }, | ||
| { | ||
| "class": "StopwordRemoval", | ||
| "inputs": { | ||
| "sentence": "OMG!!! jUSTin is AmAZEballs!!!" | ||
| }, | ||
| "outputs": [{ | ||
| "sentence": "OMG!!! jUSTin AmAZEballs!!!" | ||
| }] | ||
| }, | ||
| { | ||
| "class": "StopwordRemoval", | ||
| "inputs": { | ||
| "sentence": "To to to to" | ||
| }, | ||
| "outputs": [{ | ||
| "sentence": "" | ||
| }] | ||
| }, | ||
| { | ||
| "class": "StopwordRemoval", | ||
| "inputs": { | ||
| "sentence": "Neuroplasticity is a continuous processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization." | ||
| }, | ||
| "outputs": [{ | ||
| "sentence": "Neuroplasticity continuous processing allowing short-term, medium-term, long-term remodeling neuronosynaptic organization." | ||
| }] | ||
| } | ||
| ] | ||
| } |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| from nltk.corpus import stopwords | ||
| from nltk.tokenize import ToktokTokenizer | ||
| from nltk.tokenize.treebank import TreebankWordDetokenizer | ||
| from interfaces.SentenceOperation import SentenceOperation | ||
| from tasks.TaskTypes import TaskType | ||
|
|
||
|
|
||
| def stopword_remove(text): | ||
| """ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed! |
||
| Remove stopwords using standard list comprehension. | ||
| Assumes that user_input text is in the English language. | ||
| Every string in the user_input is detokenized. | ||
| Returns a tokenized version of the text with stopwords removed. | ||
| """ | ||
| stop_words = set(stopwords.words('english')) | ||
| text_tokenized = ToktokTokenizer().tokenize(text) | ||
| return [TreebankWordDetokenizer().detokenize([word for word in text_tokenized if word.lower() not in stop_words])] | ||
|
|
||
|
|
||
| class StopwordRemoval(SentenceOperation): | ||
| """ | ||
| This class offers a method for a stopword removal function to transform | ||
| the text. Stopword removal is the process of removing stopwords from a text. | ||
| The library of stopwords chosen is based on NLTK's library of stopwords. | ||
| """ | ||
| tasks = [ | ||
| TaskType.TEXT_CLASSIFICATION, | ||
| TaskType.TEXT_TO_TEXT_GENERATION, | ||
| ] | ||
| languages = ["en"] | ||
| heavy = False | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to add the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seconded |
||
|
|
||
| def __init__(self, seed=0, max_outputs=1): | ||
| super().__init__(seed, max_outputs=max_outputs) | ||
|
|
||
| def generate(self, raw_text: str): | ||
| perturbed_text = stopword_remove( | ||
| text=raw_text, | ||
| max_outputs=self.max_outputs | ||
| ) | ||
| return perturbed_text | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @juanyiloke please add your name, email and affiliation.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done! @kaustubhdhole
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!