Page Nav

HIDE

Breaking News:

latest

Ads Place

Unsupervised Text Classification with Lbl2Vec

https://ift.tt/3nISXyi An introduction to embedding-based classification of unlabeled text documents Photo by Patrick Tomasso on Unsplas...

https://ift.tt/3nISXyi

An introduction to embedding-based classification of unlabeled text documents

Photo by Patrick Tomasso on Unsplash.

Text classification is the task of assigning a sentence or document an appropriate category. The categories depend on the selected dataset and can cover arbitrary subjects. Therefore, text classifiers can be used to organize, structure, and categorize any kind of text.

Common approaches use supervised learning to classify texts. Especially BERT-based language models achieved very good text classification results in recent years. These conventional text classification approaches usually require a large amount of labeled training data. In practice, however, an annotated text dataset for training state-of-the-art classification algorithms is often unavailable. The annotation of data usually involves a lot of manual effort and high expenses. Therefore, unsupervised approaches offer the opportunity to run low-cost text classification for unlabeled data sets. In this article, you will learn how to use Lbl2Vec to perform unsupervised text classification.

How does Lbl2Vec work?

Lbl2Vec is an algorithm for unsupervised document classification and unsupervised document retrieval. It automatically generates jointly embedded label, document and word vectors and returns documents of categories modeled by manually predefined keywords. The key idea of the algorithm is that many semantically similar keywords can represent a category. In the first step, the algorithm creates a joint embedding of document and word vectors. Once documents and words are embedded in a shared vector space, the goal of the algorithm is to learn label vectors from previously manually defined keywords representing a category. Finally, the algorithm can predict the affiliation of documents to categories based on the similarities of the document vectors with the label vectors. At a high level, the algorithm performs the following steps to classify unlabeled texts:

1. Use Manually Defined Keywords for Each Category of Interest

First, we have to define keywords to describe each classification category of interest. This process requires some degree of domain knowledge to define keywords that describe classification categories and are semantically similar to each other within the classification categories.

Example keywords for different sports classification categories. Image by author.

2. Create Jointly Embedded Document and Word Vectors

An embedding vector is a vector that allows us to represent a word or text document in multi-dimensional space. The idea behind embedding vectors is that similar words or text documents will have similar vectors. -Amol Mavuduru

Therefore, after creating jointly embedded vectors, documents are located close to other similar documents and close to the most distinguishing words.

Jointly embedded word and document vectors. Image by author.

Once we have a set of word and document vectors, we can move on to the next step.

3. Find Document Vectors that are Similar to the Keyword Vectors of Each Classification Category

Now we can compute cosine similarities between documents and the manually defined keywords of each category. Documents that are similar to category keywords are assigned to a set of candidate documents of the respective category.

Classification category keywords with their respective set of candidate documents. Each color represents a different classification category. Image by author.

4. Clean Outlier Documents for Each Classification Category

The algorithm uses LOF to clean outlier documents from each set of candidate documents that may be related to some of the descriptive keywords but do not properly match the intended classification category.

Red documents are outliers that are removed from the set of candidate documents. Image by author.

5. Compute the Centroid of the Outlier Cleaned Document Vectors as Label Vector for Each Classification Category

To get embedding representations of classification categories, we compute label vectors. Later, the similarity of documents to label vectors will be used to classify text documents. Each label vector consists of the centroid of the outlier cleaned document vectors for a category. The algorithm computes document rather than keyword centroids since experiments showed that it is more difficult to classify documents based on similarities to keywords only, even if they share the same vector space.

Label vectors, calculated as centroid of the respective cleaned candidate document vectors. Points represent the label vectors of the respective topics. Image by author.

6. Text Document Classification

The algorithm computes label vector <-> document vector similarities for each label vector and document vector in the dataset. Finally, text documents are classified as category with the highest label vector <-> document vector similarity.

Classification results for all documents in the dataset. Points represent label vectors of a classification category. Document colors represent their predicted classification category. Image by author.

Lbl2Vec Tutorial

In this tutorial we will use Lbl2Vec to classify text documents from the 20 Newsgroups dataset. It is a collection of approximately 20,000 text documents, partitioned evenly across 20 different newsgroups categoties. In this tutorial, we will focus on a subset of the 20 Newsgroups dataset consisting of the categories “rec.motorcycles”, “rec.sport.baseball”, “rec.sport.hockey” and “sci.crypt”. Furthermore, we will use already predefined keywords for each classification category. The predefined keywords can be downloaded here. You can also access more Lbl2Vec examples on GitHub.

Installing Lbl2Vec

We can install Lbl2Vec using pip with the following command:

pip install lbl2vec

Reading the Data

We store the downloaded “20newsgroups_keywords.csv” file in the same directory as our Python script. Then we read the CSV with pandas and fetch the 20 Newsgroups dataset from Scikit-learn.

Preprocessing the Data

To train a Lbl2Vec model, we need to preprocess the data. First, we process the keywords to be used as input for Lbl2Vec.

We see that the keywords describe each classification category and the number of keywords varies.

Furthermore, we also need to preprocess the news articles. Therefore, we word tokenize each document and add gensim.models.doc2vec.TaggedDocument tags. Lbl2Vec needs the tokenized and tagged documents as training input format.

We can see the article texts and their classification categories in the dataframe. The “tagged_docs” column consists of the preprocessed documents that are needed as Lbl2Vec input. The classification categories in the “class_name” column are used for evaluation only but not for Lbl2Vec training.

Training Lbl2Vec

After preparing the data, we now can train a Lbl2Vec model on the train dataset. We initialize the model with the following parameters:

  • keywords_list : iterable list of lists with descriptive keywords for each category.
  • tagged_documents : iterable list of gensim.models.doc2vec.TaggedDocument elements. Each element consists of one document.
  • label_names : iterable list of custom names for each label. Label names and keywords of the same topic must have the same index.
  • similarity_threshold : only documents with a higher similarity to the respective description keywords than this treshold are used to calculate the label embedding.
  • min_num_docs : minimum number of documents that are used to calculate the label embedding.
  • epochs : number of iterations over the corpus.

Classification of Text Documents

After the model is trained, we can predict the categories of documents used to train the Lbl2Vec model.

[Out]: F1 Score: 0.8761506276150628

Our model can predict the correct document categories with a respectable F1 Score of 0.88. This is achieved without even seeing the document labels during training.

Moreover, we can also predict the classification categories of documents that were not used to train the Lbl2Vec model and are therefore completely unknown to it. To this end, we predict the categories of documents from the previously unused test dataset.

[Out]: F1 Score: 0.8610062893081761

Our trained Lbl2Vec model can even predict the classification categories of new documents with a F1 Score of 0.86. As mentioned before, this is achieved with a completely unsupervised approach where no label information was used during training.

For more details about the features available in Lbl2Vec, please check out the Lbl2Vec GitHub repository. I hope you found this tutorial to be useful.

Summary

Lbl2Vec is a recently developed approach that can be used for unsupervised text document classification. Unlike other state-of-the-art approaches it needs no label information during training and therefore offers the opportunity to run low-cost text classification for unlabeled datasets. The open-source Lbl2Vec library is also very easy to use and allows developers to train models in just a few lines of code.

Sources

  1. Schopf, T.; Braun, D. and Matthes, F. (2021). Lbl2Vec: An Embedding-based Approach for Unsupervised Document Retrieval on Predefined Topics, (2021), Proceedings of the 17th International Conference on Web Information Systems and Technologies
  2. https://github.com/sebischair/Lbl2Vec

Unsupervised Text Classification with Lbl2Vec was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/30Xq08Q
via RiYo Analytics

No comments

Latest Articles