https://ift.tt/3nfIhq5 With more than 40.000 searches happening in Google per second, Google Trends is a powerful tool that allows us to v...
With more than 40.000 searches happening in Google per second, Google Trends is a powerful tool that allows us to visualize searching behavior and uncover trends in Web Search, Google News, Google Images, Google Shopping, and YouTube.
A sample of that size can provide a lot of insights to inform a business marketing strategy, which products or services to focus on, identify interests based on location, and much more.
To get more out of the tool, today we’ll build a simple Google Trends scraper using PyTrends, an unofficial Google Trends API.
For the full article, click here.
How to Build a Google Trends Scraper with PyTrends
Now that we know the basics, let’s start writing our PyTrends Script:
1. Install Python and PyTrends
If you’re using Mac, you probably already have a version of Python installed on your machine. To check if that’s the case, enter python -v
into your terminal.
For those of you who don’t have any version of python installed or want to upgrade, we recommend using Homebrew, instructions are inside the link.
With Homebrew package manager installed, you can now install the last version of Python by using the command brew install python3
. You’ll be able to verify the installation with python3 --version command
.
The next step is to install PyTrends using pip install pytrends
. If you don’t have pip installed in your machine, Python is now able to install it without any extra tool. Just go to your terminal and type sudo -H python -m ensurepip
.
Your development environment should be ready to go now!
Note: when we check the documentation for PyTrends, it says that it requires Requests, LXML, and Pandas. If you don’t have those, just pip install them as well.
2. Connect to Google Trends Using PyTrends
Alright, we’ll create a new file named ‘gtrends-scraper.py’ and open it in your text editor and we’ll add our first lines to connect to Google Trends.
|
Note: hl stands for ‘host language’ and it can be changed for any other location you might need.
3. Write the Payload
The payload is where we’ll store all the parameters of our request to be sent to the server. When checking PyTrend’s documentation, we can see there are five different inputs we can add to our payload (which are the same we would use on the original platform).
- kw_list (list of keywords we want to analyze)
- cat (category)
- timeframe
- geo (region or location of the data)
- gprop (Google’s property)
Let’s take look at the snippet presented for us in the documentation:
|
The cat is equal to 0 as a default. If you go to Google Trends and change the category, you’ll be able to see that every category has a value assigned.
For example, for ‘Arts & Entertainment’ cat is equal to 3. Depending on your needs, you can change this value to whatever category you want to select. For this example, let’s set it as 14 for ‘People & Society’.
Note: Noticed that although ‘blockchain’ is only one keyword, it is still passed as a list. Also, there’s a limit of 5 keywords we can send at a time but that’s because it’s the same limit of keywords we can compare on Google Trends’ site. But you can add more than 5 to the list as long as you’re passing them one by one.
We’ll add our parameters values as variables to make it easier to work with them. So here’s how our code is looking so far:
|
- We use a variable named
all_keywords
because we’ll use it to pass each keyword individually - if not, it would compare them with each other. - We added a variable for time frames to be able to analyze the keyword from different timeframes. However, the
[0]
will select the first one of the list.
3. Extract Data from Google Trends
The first thing we need to implement is a temporary list called keywords and make it an empty list: keywords = []
. Then, we’ll wrap our payload inside a function called check_trends()
.
Plus, we’ll add a new variable to our function that will return pandas.Dataframe: data = pytrends.interest_over_time(
)
.
|
The for loop at the end will take a keyword from our main list and append it to our empty list. Then it will pass only that keyword to our check_trends()
function for analysis before popping it out, leaving an empty list again, and pushing the next one.
4. Determining If There’s a Trend
Our code as-is will get data from Google Trends and bring it back. However, we’ll need to do further analysis if to interpret the data.
Let’s start by calculating the mean:
|
Here’s how the output should look like:
Here’s where understanding the way Google Trends work will pay off. Because we’re reading the mean in a 5 years timeframe, we need to understand that there was a point when the interest overtime was 100, so a low number would mean that over the five years, the interest has been pretty low.
To make it easier to visualize here’s the graph for ‘event management’:
As you can see, over the period of five years, it has stayed pretty stable with the highest peak around February 2020. We could read this query as stable but not rising in popularity. So for a new business, there’s a stable need for event management professionals, services, and software.
Ok, we might need more information to jump to that conclusion, but as you can see, that’s exactly the kind of assumption we can make by using data from Google Trends.
For example, event planning at 25.96 is fairly low in comparison to the peak, so we could conclude that this is a query without much popularity, right? Well, maybe. But remember that this is a mean, it could happen that it is a seasonal query with peaks closer to 100 in specific months.
Automating Conclusions With PyTrends
Here’s an exercise, we want to know how does last year’s trend compares to the rest of the timeframe.
|
What we’re doing at this point is calculating the average trend without the last 52 weeks (because Google Trends is sending us weekly data) that would represent the last year and then calculating the change of trend converting it in percentages.
This is how the outcome would look like:
So there’s a new conclusion to be made. Yes, the mean interest of the queries has been fairly stable (spacially for ‘event management’), but the overall interest has been decreasing in the last year.
However, we can automate more conclusions faster using Python than just going through the data by hand. These are a few conclusions (but not the only ones) we could also run:
|
The power of automating Google Trends with PyTrends is that we can automate as many conclusions as we want, and by changing the list of keywords, we can pull a lot of conclusions with just the press of a button.
Before we say goodbye, we want to show you, fairly quickly, another route you can take to pull trends data using JavaScript this time.
Note: If you want to learn more about Python for web scraping, check our guides on how to scrape multiple pages with Python and Scrapy, and how to build a Beautiful Soup scraper from scratch.
Google Trends Scraping with Fetch and Cheerio: Alternative Route
Data is crucial for any business decision. For this use case, let’s say we want to start a new business but we don’t want to just depend on intuition. We want to create a product that has demand right now.
Setting Up Our Development Environment
To set up our development environment follow these instructions:
- First, download and install Node.js on your machine: https://nodejs.org/en/download/.
- Next, create a new folder for your project (we’ll use the same we used for our PyTrends project) and navigate to it from your terminal.
- Inside the folder, enter the command
npm init -y
to create the necessary files. - To install our dependencies, let’s type
npm install cheerio
and thennpm i node-fetch
Pretty simple, right? So let’s move to the next step!
Fetching Our Target URL
Exploding Topics is a relatively new site that analyzes millions of web searches to identify rising (or exploding) queries/topics. This makes it a perfect alternative to Google Trends but the caveat is that we don’t need to analyze the data itself to make conclusions as the website is meant to provide only trending and raising queries.
We’ll change the parameters to ‘1 month’ (to try to catch those new rising topics) and business, and grab the resulting URL ‘https://explodingtopics.com/business-topics-this-month’.
import fetch from 'node-fetch'; import { load } from 'cheerio'; (async function() { const res = await fetch('https://explodingtopics.com/business-topics-this-month'); const text = await res.text(); |
Our code is now fetching the URL and then storing the response in the text constant.
That said, there’s one more thing we need to do before parsing the response.
If we want our scraper to scale, we can’t just send all requests through our IP address. Websites will quickly figure out our script isn’t a human and will block it. Making our web scraper useless.
Integrating Fetch with ScraperAPI
For the next step, we want to tell Fetch to send the request through ScraperAPI’s servers. There, it will change the IP address automatically for every request, handle any CAPTCHAs that might get in the way, and use years of statistical analysis to determine the best header to use for the request.
All of this will be handled for us by just adding a few lines to our URL.
First, we’ll create a free ScraperAPI account to generate an API key.
With our API key, we can now build our target URL:
const res = await fetch('http://api.scraperapi.com?api_key=51e43be283e4db2a5afb62660xxxxxxx&url=https://explodingtopics.com/business-topics-this-month'); |
Parsing the HTML Response with Cheerio
Now that our scraper is blacklisting-proof, we can send the response to Cheerio for parsing.
First of all, we need to declare Cheerio at the top of our file like this import { load }
from 'cheerio'
and then add const $ = load(text)
to our function.
Done! Our response is now in Cheerio and we can navigate it using CSS selectors.
Let’s say that we want to pull the title, description, and searches per month. For that, let’s grab the bigger element that’s wrapping the elements we’re looking for, a <div>
with the class “topicInfoContainer”:
const containers = $('.topicInfoContainer').toArray(); |
As you can see, we’re converting the element into an array. It will allow us to select several elements inside the big element we just grab.
Using CSS selectors, we can then get every element inside the Array.
For the sake of simplicity and time, here’s the finished code:
|
If you run this code in your terminal (adding your own API key), the output should look like this:
Note: For a more in-depth guide on using JavaScript for web scraping, check our step-by-step tutorial on building a Node.js web scraper.
Congratulations! You have now successfully built not only one but two web scrapers that will bring more and better data to your business.
Whether you are using PyTrends to interact with Google Trends and automate conclusions or using Node.js to scraper Exploding Topics to find new business opportunities, ScraperAPIis ready to help you scrape the internet without getting blocked by anti-scraping techniques.
Until next time, happy scraping!
from Featured Blog Posts - Data Science Central https://ift.tt/3HmitR7
via RiYo Analytics
No comments