https://ift.tt/qOcHZPy How to efficiently summarize a text with Python and NLTK Photo by Mel Poole on Unsplash Sometimes you need a s...
How to efficiently summarize a text with Python and NLTK
Sometimes you need a summary of a given text. I ran into this challenge when I was building a collection of news posts. Using the complete text to interpret the meaning of an article took a lot of time (I have about 250.000 collected), so I started looking for a way to summarize a text to three sentences. This article describes the relative but surprisingly effective way to create a summary.
The algorithm
The goal of this algorithm is to summarize the content of a few paragraphs of text to a few sentences. The sentences will be taken from the original text. No text generation will be used.
The idea is that the objective of the text can b found by identifying the most used words in the text. Common stop words will be excluded. After finding the mot used words, the sentences are found that contain these words the most, using a weighted counting algorithm (the more a word is used in the text, the higher the weight). The sentences with the highest weight are selected and form the summary:
1. Count occurrences per word in the text (stop words excluded)
2. Calculate weight per used word
3. Calculate sentence weight by summarizing weight per word
4. Find sentences with the highest weight
5. Place these sentences in their original order
The algorithm uses the natural language toolkit (NLTK) to split text into sentences and sentences into words. NLTK is installed using pip:
pip install nltk numpy
But we start with importing the required modules and building the list op stop words to exclude from the algorithm:
A list of stop words per language can be found easily on the internet.
The first step of the algorithm is building the list with word frequencies in the text:
The code builds a dictionary with the words as keys and the number of occurrences for the words as values. The text is splitted into lower case words using the word tokenizer of the NLTK (line 4–5). All words are converted to lower case so we count words in a sentence and at a start of a sentence. Otherwise, ‘cat’ will be identified as a different word as ‘Cat’.
If the length of the word is at least two characters, removing tokens like ‘,’, ‘.’ etc, and it is not in the list of stopwords (line 6), the number of occurrences is increased by one if the word is already in the dictionary or added added to the dictionary with a count of 1 (lines 7–10).
When the complete text is parsed, the ‘word_weights’ dictionary contains all words with their respective count. The actual count is used as word weight. It is possible to scale these values between 0 and 1 by dividing the occurrences with the highest number of occurrences but although this might be intuitive to do, it does not change the working of the algorithm. Not performing this division, saves precious time.
Now we can determine the weight of each sentence in the text and find the highest weights:
First, the text is splitted into sentences with sent_tokenize(). Each sentence is then split into words and the individual word weights are summarized per sentence to determine the sentence weights. After these loops the dictionary sentence_weights contains for each sentence the weights and thus their importance.
The most important sentence weights can be found (line 12) by taking the values from the dictionary, sorting these and taking the last n values, where n is the number of sentences we want for the summary. The variable highest_weights contains the weights of those sentences that need to end up in the summary.
The last step is combining these sentences to a summary. There are two options. First, we can put them in order of importance and second, we can use the original order they occur in the supplied text. After some experiments the latter option is the better:
The summary is created by walking through all sentences and their weights and using the sentences with a weight from the highest_weights. A dictionary keeps the order of addition so the sentences are parsed according to their occurrence in the text. Finally, some cleanup takes place and we end up with a surprisingly accurate summary.
The final step is to combine all steps into a function:
This function has an important position in my news archive project. Reducing a text from tens of sentences to three sentences really helps.
The code is not flawless. The sentence tokenizer is not the best, but quite fast. Sometimes I end up with a summary of 4 sentences as it misses the separation of two sentences. But the speed is worth these errors.
The same is valid for the final part where the summary is created. If there are multiple sentences with the lowest weight in the heighest_weights, all of them will be added to the summary.
It is not always necessary to write flawless code. A summary with a sentence too much does not break the bank or crash the system. In this case speed goes over accuracy.
If a larger text needs to be summarized, break it up in parts of 20 to 50 sentences, paragraphs or chapters and generate a summary for each part. Combining these summaries will result in the summary for the whole text. One naive way to do this:
Note that this code extracts sentences from text, groups them and concatenates them to a single string before calling the summarize method. This method will extract the sentences again. If large texts need to be summarized, adapt the summarize function to accept a list of sentences.
The function works on multiple languages. I have tested it with English, Dutch and German and when you are using the right list op stop words it works for each of them.
Enjoy!
Final words
I hope you enjoyed this article. For more inspiration, check some of my other articles:
- Getting started with F1 analysis and Python
- Solar panel power generation analysis
- Perform a function on columns in a CSV file
- Create a heatmap from the logs of your activity tracker
- Parallel web requests with Python
If you like this story, please hit the Follow button!
Disclaimer: The views and opinions included in this article belong only to the author.
Summarize a text with Python was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/qDoMLga
via RiYo Analytics
No comments