How to Perform Data Augmentation in NLP Projects

https://ift.tt/mcXNgGd A simple way to conduct Data Augmentation by using TextAttack Library Image by Gerd Altmann from Pixabay In ma...

https://ift.tt/mcXNgGd

A simple way to conduct Data Augmentation by using TextAttack Library

In machine learning, it is crucial to have a large amount of data in order to achieve strong model performance. Using a method known as data augmentation, you can create more data for your machine learning project. Data augmentation is a collection of techniques that manage the process of automatically generating high-quality data on top of existing data.

In computer vision applications, augmenting approaches are extremely prevalent.If you are working on a computer vision project (e.g Image classification), for instance, you can apply dozens of techniques to each image: shift, modify color intensities, scale, rotate, crop, etc.

If you have a tiny dataset for your ML project or wish to reduce overfitting in your machine learning models, it is recommended that you may apply data augmentation approaches.

“We don’t have better algorithms. We just have more data.”- Peter Norvig

In the field of Natural Language Processing (NLP), the tremendous level of complexity that language possesses, makes it difficult to augment the text. The process of augmenting text data is more challenging and not as straightforward as some might expect.

In this article, you will learn how to use a library called TextAttack to improve data for natural language processing.

What is TextAttack?

TextAttack is a Python framework that was built by the QData team for the purpose of conducting adversarial attacks, adversarial training, and data augmentation in natural language processing. TextAttack has components that can be utilized independently for a variety of basic natural language processing tasks, including sentence encoding, grammar checking, and word substitution.

TextAttack excels in performing the following three functions:

Adversarial attacks (Python: textattack.Attack, Bash: textattack attack).
Data augmentation (Python: textattack.augmentation.Augmenter, Bash: textattack augment).
Model training (Python: textattack.Trainer, Bash: textattack train).

Note: For this article, we will focus on how to use the TextAttack library for Data augmentation.

How to Install TexAttack

To use this library make sure you have python 3.6 or above in your environment.

Run the following command to install textAttack.

pip install textattack

Note: Once you have installed TexAttack, you can run it via the python module or via command-line.

Data Augmentation Techniques for Text Data

TextAttack library has various augmentation techniques that you can use in your NLP project to add more text data. Here are some of the techniques that you can apply:

1.CharSwapAugmenter
It augments words by swapping characters out for other characters.

from textattack.augmentation import CharSwapAugmenter

text = "I have enjoyed watching that movie, it was amazing."

charswap_aug = CharSwapAugmenter()

charswap_aug.augment(text)

[‘I have enjoyed watching that omvie, it was amazing.’]

The Augmenter has swapped the word “movie” to “omvie”.

2.DeletionAugmenter
It augments the text by deleting some parts of the text to make new text.

from textattack.augmentation import DeletionAugmenter

text = "I have enjoyed watching that movie, it was amazing."

deletion_aug = DeletionAugmenter()

deletion_aug.augment(text)

[‘I have watching that, it was amazing.’]

This method has removed the word “enjoyed” to create a new augmented text.

3.EasyDataAugmenter
This augments the text with a combination of different methods, such as

Randomly swap the positions of the words in the sentence.
Randomly remove words from the sentence.
Randomly insert a random synonym of a random word at a random location.
Randomly replace words with their synonyms.

from textattack.augmentation import EasyDataAugmenter

text = "I was billed twice for the service and this is the second time it has happened"

eda_aug = EasyDataAugmenter()

eda_aug.augment(text)

[‘I was billed twice for the service and this is the second time it has happen’,
‘I was billed twice for the one service and this is the second time it has happened’,
‘I billed twice for the service and this is the second time it has happened’,
‘I was billed twice for the this and service is the second time it has happened’]

As you can see from the augmented texts, it shows different results based on the methods applied. For example in the first augmented text, the last word has been modified from “happened” to “happen”.

4.WordNetAugmenter
It can augment the text by replacing it with synonyms from the WordNet thesaurus.

from textattack.augmentation import WordNetAugmenter

text = "I was billed twice for the service and this is the second time it has happened"

wordnet_aug = WordNetAugmenter()

wordnet_aug.augment(text)

[‘I was billed twice for the service and this is the second time it has pass’]

This method has changed the word “happened” to “pass” in order to create a new augmented text.

5. Create your Own Augmenter
Importing transformations and constraints from textattack.transformations and textattack.constraintsallows you to build your own augmenter from the ground up. The following is an illustration of the use of the WordSwapRandomCharacterDeletionalgorithm to produce augmentations of a string:

from textattack.transformations import WordSwapRandomCharacterDeletion
from textattack.transformations import CompositeTransformation
from textattack.augmentation import Augmenter

my_transformation = CompositeTransformation([WordSwapRandomCharacterDeletion()])
augmenter = Augmenter(transformation=my_transformation, transformations_per_example=3)

text = 'Siri became confused when we reused to follow her directions.'

augmenter.augment(text)

[‘Siri became cnfused when we reused to follow her directions.’,
‘Siri became confused when e reused to follow her directions.’,
‘Siri became confused when we reused to follow hr directions.’]

The output shows different augmented texts after implementing theWordSwapRandomCharacterDeletionmethod. For example, in the first augmented text, the method randomly removes the character “o” in the word “confused”.

Conclusion

In this article, you have learned the significance of data augmentation for your Machine Learning project. In addition, you have learned how to execute data augmentation for textual data using the TextAttack library.

To the best of my knowledge, these techniques are the most effective approaches available to do the task for your NLP project. Hopefully, they’ll be of use to you in your work.

You can also try to use other available augmentation techniques from the TextAttack library such as:

EmbeddingAugmenter
CheckListAugmenter
CLAREAugmenter

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

You can also find me on Twitter @Davis_McDavid.

One last thing: Read more articles like this in the following links

This article was first published here.

How to Perform Data Augmentation in NLP Projects was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

from Towards Data Science - Medium https://ift.tt/1Nxk2IV
via RiYo Analytics

Page Nav

Pages

Breaking News:

Ads Place

How to Perform Data Augmentation in NLP Projects

https://ift.tt/mcXNgGd A simple way to conduct Data Augmentation by using TextAttack Library Image by Gerd Altmann from Pixabay In ma...

A simple way to conduct Data Augmentation by using TextAttack Library

What is TextAttack?

How to Install TexAttack

Data Augmentation Techniques for Text Data

Conclusion

Related Posts

No comments

Top of the month

How to Become an AI Engineer in 2026 (A Complete Roadmap)

21世纪最好的100部电影

Document Chunking Strategies for Vector Databases

Is GPT Image 2 the Best Image Generation Model?

Latest Posts

Cloud Labels

Search This Blog

Report Abuse

Contributors

Happy To Help You

Popular Tag

Latest Articles

Featured Post

Elon Musk Plans to Launch Alternative Phone if Apple, Google Boot Twitter off Their App Stores

Hot of the Week

Power BI Tutorial: Create Your First Dashboard

Project Tutorial: Cleaning and Analyzing Used Car Listings from eBay Kleinanzeigen

21世纪最好的100部电影

After Seeing ‘F1,’ Here’s What to Stream Next

Labels

Footer Menu

Popular Posts

Spider-Man: No Way Home Torrents May Contain Crypto Malware, Cybersecurity Firm Warns

10 Impressive Tableau Projects for Your Portfolio

3air Leverages Blockchain Technology to Deliver Extensive Broadband Connectivity in Africa

NLP Startup Funding in 2022

Page Nav

Ads Place

How to Perform Data Augmentation in NLP Projects

https://ift.tt/mcXNgGd A simple way to conduct Data Augmentation by using TextAttack Library Image by Gerd Altmann from Pixabay In ma...

A simple way to conduct Data Augmentation by using TextAttack Library

What is TextAttack?

How to Install TexAttack

Data Augmentation Techniques for Text Data

Conclusion

Related Posts

No comments

Connect WIth Us

Top of the month

How to Become an AI Engineer in 2026 (A Complete Roadmap)

21世纪最好的100部电影

Document Chunking Strategies for Vector Databases

Is GPT Image 2 the Best Image Generation Model?

Latest Posts

Cloud Labels

Search This Blog

Report Abuse

Contributors

Happy To Help You

Popular Tag

Latest Articles

Power BI Tutorial: Create Your First Dashboard

Project Tutorial: Cleaning and Analyzing Used Car Listings from eBay Kleinanzeigen

21世纪最好的100部电影

After Seeing ‘F1,’ Here’s What to Stream Next

Popular Posts

Spider-Man: No Way Home Torrents May Contain Crypto Malware, Cybersecurity Firm Warns

10 Impressive Tableau Projects for Your Portfolio

3air Leverages Blockchain Technology to Deliver Extensive Broadband Connectivity in Africa

NLP Startup Funding in 2022