https://ift.tt/9MESXyI In this post, we’ll show you where to find datasets for various projects in the following areas: Excel Python R ...
In this post, we’ll show you where to find datasets for various projects in the following areas:
- Excel
- Python
- R
- Data science
- Data visualization
- Data cleaning
- Machine learning
- Probability and statistics
Whether you want to strengthen your portfolio by showing that you can visualize data well, or you have a spare few hours and want to practice your machine learning skills, this article has everything you need.
Looking for Datasets to Build Projects? We’ve Got Covered
If you’re trying to find free datasets so that you can learn by building projects, we have plenty of options for you.
Here at Dataquest, a majority of our courses contain projects for you to complete using real, high-quality datasets. The projects are designed to help you showcase your skills and give you something to add to your portfolio.
If you’re interested, check out some of the projects we have available below. Signing up is completely free and the datasets are downloadable.
Excel
- Identify Customers Likely to Churn: Use an Excel dataset to conduct an exploratory data analysis (EDA) for a telecommunications provider to identify customers that are at risk of churn.
- Analyze Retail Sales: Work with retail sales data to explore trends and relationships. Build basic models to confirm the statistical significance of your insights.
Our Data Analysis with Excel path contains 2 other projects. Sign up for free here.
Data Cleaning (Python)
- Analyze Star Wars Surveys: Use survey data to better understand Star Wars fans.
- Explore eBay Car Sales Data: Use a custom-scrapped dataset of eBay’s used car listings to practice data cleaning and data exploration.
- Find Heavy Traffic Indicators on I-94: Use a dataset about traffic on an interstate highway and perform exploratory data visualization.
- Explore Hacker News Posts: Use a dataset from Hacker News submissions to practice using loops, cleaning strings, and dates in Python.
Our Data Cleaning with Python path contains 4 other projects. Sign up for free here.
Data Analysis and Visualization (Python)
- Create Data Visualization on Euro Exchange Rates: Use a dataset from the European Central Bank to create visualizations using Matplotlib.
- Determine Which Mobile Apps Attract More Users: Use two separate datasets to analyze Android and iOS apps to determine the types of apps that are likely to attract users.
Our Data Analysis and Visualization with Python path contains 3 other projects. Sign up for free here.
Data Analysis (R)
- Investigate COVID-19 Trends: Use a Kaggle dataset and RStudio to analyze important COVID-19 trends.
- Analyze Sales Data of a Bookstore: Use a dataset to apply control flow loops and functions to create a reusable data workflow.
Our R Basics for Data Analysis path contains 2 other projects. Sign up for free here.
Machine Learning (Python)
- Predict House Sale Prices: Use housing data from a city in the United States to build and improve linear regression models.
- Predict the Stock Market: Use historical data from the S&P 500 Index to make predictions about future prices.
- Predict Bike Rentals: Use a dataset of bike rentals and apply decision trees and random forests to predict the number of future bike rentals.
Our Machine Learning Intro with Python path contains 15 other projects. Sign up for free here.
Probability and Statistics (Python)
- Investigate Fandango Movie Ratings: Use a custom data set made by our team and perform practical analysis to determine if there’s a bias in Fandango’s movie rating system.
- Find the Best Markets to Advertise In: Use survey data from freeCodeCamp to determine the most effective advertising markets for an e-learning platform.
- Build a Spam Filter: Use an SMS spam collection dataset to build a spam filter using conditional probability and Naive Bayes.
Our Probability and Statistics with Python path contains 9 other projects. Sign up for free here.
Public Datasets for Data Visualization Projects
A typical data visualization project might be something along the lines of “I want to make an infographic about how income varies across the different states in the US.” There are a few considerations to keep in mind when looking for a good dataset for a data visualization project:
- It shouldn’t be messy, because you don’t want to spend a lot of time cleaning data.
- It should be nuanced and interesting enough to make charts about.
- Ideally, each column should be well-explained, so the visualization is accurate.
- The data set shouldn’t have too many rows or columns, so it’s easy to work with.
Good places to find good datasets for data visualization projects are news sites that release their data publicly. They typically clean the data for you and already have charts that you can replicate or improve.
1. FiveThirtyEight
FiveThirtyEight is an incredibly popular interactive news and sports site started by Nate Silver. They write interesting data-driven articles, like “Don’t blame a skills gap for lack of hiring in manufacturing” and “2016 NFL Predictions.”
FiveThirtyEight makes the datasets used in its articles available online on GitHub.
View the FiveThirtyEight Datasets
Here are some examples:
- Airline Safety — contains information on accidents from each airline.
- US Weather History — historical weather data for the US.
Study Drugs — data on who’s taking Adderall in the US.
2. BuzzFeed
BuzzFeed started as a purveyor of low-quality articles, but has since evolved and now writes some investigative pieces, like “The court that rules the world” and “The short life of Deonte Hoard.”
BuzzFeed makes the data sets used in its articles available on Github.
Here are some examples:
- Federal Surveillance Planes — contains data on planes used for domestic surveillance.
- Zika Virus — data about the geography of the Zika virus outbreak.
- Firearm Background Checks — data on background checks of people attempting to buy firearms.
3. NASA
NASA is a publicly-funded government organization, and thus all of its data is public. It maintains websites where anyone can download its datasets related to earth science and datasets related to space. You can even sort by format on the earth science site to find all of the available CSV datasets, for example.
Public Datasets for Data Processing Projects
Sometimes you just want to work with a large dataset. The end result doesn’t matter as much as the process of reading in and analyzing the data. You might use tools like Spark or Hadoop to distribute the processing across multiple nodes. Things to keep in mind when looking for a good data processing dataset:
- The cleaner the data, the better — cleaning a large dataset can be very time consuming.
- The dataset should be interesting.
- There should be an interesting question that can be answered with the data.
Good places to find large public data sets are cloud-hosting providers like Amazon and Google. They have an incentive to host the data sets because they make you analyze them using their infrastructure (and pay them to use it).
4. AWS Public Data sets
Amazon makes large datasets available on its Amazon Web Services platform. You can download the data and work with it on your own computer or analyze the data in the cloud using EC2 and Hadoop via EMR. You can read more about how the program works here.
Amazon has a page that lists all of the datasets for you to browse. You’ll need an AWS account, although Amazon provides a free access tier for new accounts that will enable you to explore the data without being charged.
Here are some examples:
- Lists of n-grams from Google Books — common words and groups of words from a huge set of books.
- Common Crawl Corpus — data from a crawl of over 5 billion web pages.
- Landsat Images — moderate resolution satellite images of the surface of the Earth.
5. Google Public Data sets
Much like Amazon, Google also has a cloud-hosting service, called Google Cloud Platform. With GCP, you can use a tool called BigQuery to explore large datasets.
Google lists all of the data sets on a page. You’ll need to sign up for a GCP account, but the first 1TB of queries you make are free.
Here are some examples:
- USA Names — contains all Social Security name applications in the US, from 1879 to 2015.
- Github Activity — contains all public activity on over 2.8 million public Github repositories.
Historical Weather — data from 9000 NOAA weather stations from 1929 to 2016.
6. Wikipedia
Wikipedia is a free, online, community-edited encyclopedia. Wikipedia contains an astonishing breadth of knowledge, containing pages on everything from the Ottoman-Habsburg Wars to Leonard Nimoy. As part of Wikipedia’s commitment to advancing knowledge, they offer their content for free and regularly generate dumps of all the articles on the site. Additionally, Wikipedia offers edit history and activity, so you can track how a page on a topic evolves over time and who contributes to it.
You can find the various ways to download the data on the Wikipedia site. You’ll also find scripts to reformat the data in various ways.
Here are some examples:
- All Images and Other Media from Wikipedia — all the images and other media files on Wikipedia.
- Full Site Dumps — of the content on Wikipedia, in various formats.
Public Datasets for Machine Learning Projects
When you’re working on a machine learning project, you want to be able to predict a column from the other columns in a dataset. In order to be able to do this, we need to make sure that:
- The dataset isn’t too messy — if it is, we’ll spend all of our time cleaning the data.
- There’s an interesting target column to make predictions for.
- The other variables have some explanatory power for the target column.
There are a few online repositories of datasets that are specifically for machine learning. These datasets are typically cleaned up beforehand, and allow for testing of algorithms very quickly.
7. Kaggle
Kaggle is a data science community that hosts machine learning competitions. There are a variety of externally-contributed, interesting datasets on the site. Kaggle has both live and historical competitions. You can download data for either, but you have to sign up for Kaggle and accept the terms of service for the competition.
You can download data from Kaggle by entering a competition. Each competition has its own associated dataset. There are also user-contributed datasets found in the new Kaggle Datasets offering.
Here are some examples:
- Satellite Photograph Order — a dataset of satellite photos of Earth — the goal is to predict which photos were taken earlier than others.
- Manufacturing Process Failures — a dataset of variables that were measured during the manufacturing process. The goal is to predict faults with the manufacturing.
Multiple Choice Questions — a dataset of multiple choice questions and the corresponding correct answers. The goal is to predict the answer for any given question.
8. UCI Machine Learning Repository
The UCI Machine Learning Repository is one of the oldest sources of datasets on the web. Although the datasets are user-contributed, and thus have varying levels of documentation and cleanliness, the vast majority are clean and ready for machine learning to be applied. UCI is a great first stop when looking for interesting datasets.
You can download data directly from the UCI Machine Learning repository, without registration. These datasets tend to be fairly small, and don’t have a lot of nuance, but are good for machine learning.
View UCI Machine Learning Repository
Here are some examples:
- Email Spam — contains emails, along with a label of whether or not they’re spam.
- Wine Classification — contains various attributes of 178 different wines.
Solar Flares — attributes of solar flares, useful for predicting characteristics of flares.
9. Quandl
Quandl is a repository of economic and financial data. Some of this information is free, but many datasets require purchase. Quandl is useful for building models to predict economic indicators or stock prices. Due to the large number of available datasets, it’s possible to build a complex model that uses many datasets to predict values in another.
Here are some examples:
- Entrepreneurial Activity By Race and Other Factors — contains data from the Kauffman foundation on entrepreneurs in the US.
- US Federal Reserve Data — US economic indicators, from the Federal Reserve.
Public Datasets for Data Cleaning Projects
When looking for a good dataset for a data cleaning project, you want:
- Be spread over multiple files.
- Have a lot of nuance, and many possible angles to take.
- Require a good amount of research to understand.
- Be as “real-world” as possible.
These types of datasets are typically found on aggregators of datasets. These aggregators tend to have datasets from multiple sources, without much curation. Too much curation gives us overly neat datasets that are hard to do extensive cleaning on.
10. data.world
data.world describes itself as ‘the social network for data people,’ but could be more correctly described as ‘GitHub for data.’ It’s a place where you can search for, copy, analyze, and download datasets. In addition, you can upload your data to data.world and use it to collaborate with others.
In a relatively short time it has become one of the ‘go to’ places to acquire data, with lots of user contributed datasets as well as fantastic datasets through data.world’s partnerships with various organizations, including a large amount of data from the US Federal Government.
One key differentiator of data.world is they have built tools to make working with data easier – you can write SQL queries within their interface to explore data and join multiple datasets. They also have SDK’s for R and Python to make it easier to acquire and work with data in your tool of choice (You might be interested in reading our tutorial on the data.world Python SDK.)
11. Data.gov
Data.gov is a relatively new site that’s part of a US effort towards open government. Data.gov makes it possible to download data from multiple US government agencies. Data can range from government budgets to school performance scores. Much of the data requires additional research, and it can sometimes be hard to figure out which dataset is the “correct” version. Anyone can download the data, although some datasets require additional hoops to be jumped through, like agreeing to licensing agreements.
You can browse the data sets on Data.gov directly, without registering. You can browse by topic area or search for a specific dataset.
Here are some examples:
- Food Environment Atlas — contains data on how local food choices affect diet in the US.
- School System Finances — a survey of the finances of school systems in the US.
Chronic Disease Data — data on chronic disease indicators in areas across the US.
12. The World Bank
The World Bank is a global development organization that offers loans and advice to developing countries. The World Bank regularly funds programs in developing countries, then gathers data to monitor the success of these programs.
You can browse World Bank datasets directly, without registering. The datasets have many missing values, and sometimes take several clicks to actually get to data.
Here are some examples:
- World Development Indicators — contains country-level information on development.
- Educational Statistics — data on education by country.
World Bank Project Costs — data on World Bank projects and their corresponding costs.
13. /r/datasets
Reddit, a popular community discussion site, has a section devoted to sharing interesting datasets. It’s called the datasets subreddit, or /r/datasets. The scope of these datasets varies a lot, since they’re all user-submitted, but they tend to be very interesting and nuanced.
You can browse the subreddit here. You can also see the most highly upvoted datasets here.
Here are some examples:
- All Reddit Sublessons — contains reddit sublessons through 2015.
- Jeopardy Questions — questions and point values from the game show Jeopardy.
New York City Property Tax Data — data about properties and assessed value in New York City.
14. Academic Torrents
Academic Torrents is a new site that is geared around sharing the datasets from scientific papers. It’s a newer site, so it’s hard to tell what the most common types of datasets will look like. For now, it has tons of interesting datasets that lack context.
You can browse the datasets directly on the site. Since it’s a torrent site, all of the datasets can be immediately downloaded, but you’ll need a Bittorrent client. Deluge is a good free option.
View Academic Torrents Datasets
Here are some examples:
- Enron Emails — a set of many emails from executives at Enron, a company that famously went bankrupt.
- Student Learning Factors — a set of factors that measure and influence student learning.
- News Articles — contains news article attributes and a target variable.
Bonus: Streaming data
It’s very common when you’re building a data science project to download a dataset and then process it. However, as online services generate more and more data, an increasing amount is generated in real-time, and not available in dataset form. Some examples of this include data on tweets from Twitter, and stock price data. There aren’t many good sources to acquire this kind of data, but we’ll list a few in case you want to try your hand at a streaming data project.
15. Twitter
Twitter has a good streaming API, and makes it relatively straightforward to filter and stream tweets. You can get started here. There are tons of options here — you could figure out what states are the happiest, or which countries use the most complex language. We also recently wrote an article to get you started with the Twitter API here.
Get started with the Twitter API
16. GitHub
GitHub has an API that allows you to access repository activity and code. You can get started with the API here. The options are endless — you could build a system to automatically score code quality, or figure out how code evolves over time in large projects.
Get started with the Github API
17. Wunderground
Wunderground has an API for weather forecasts that free up to 500 API calls per day. You could use these calls to build up a set of historical weather data, and make predictions about the weather tomorrow.
Get started with the Wunderground API
18. Global Health Observatory
The World Health Organization (WHO) maintains a large dataset on global health at the Global Health Observatory (GHO). The dataset includes all the WHO data on the COVID-19 global pandemic. The GHO offers a diverse range of data on topics such as antimicrobial resistance, dementia, air pollution, and immunization.
You can find data on pretty much any health-related topic at the GHO, making it an extremely valuable free dataset resource for data scientists working in the health field.
19. Pew Research Center
The Pew Research Center is well-known for political and social science research. In the interest of furthering research and public discourse, they make all of their datasets publicly downloadable for secondary analysis, after a set period of time elapses.
You can choose from datasets on US politics, journalism and media, internet and tech, science and society, religion and public life, amongst other topics.
20. National Climatic Data Center
Climate change is a hot topic at the moment, if you’ll pardon the pun. Data scientists who want to crunch the numbers on weather and climate can access large US datasets from the National Centers for Environmental Information (NCEI).
Bonus: Personal Data
The internet is full of cool datasets you can work with. But for something truly unique, what about analyzing your own personal data?
Here are some popular sites that make it possible to download and work with data you’ve generated.
21. Amazon
Amazon allows you to download your personal spending data, order history, and more. To access it, click this link (you’ll need to be logged in for it to work) or navigate to the Accounts and Lists button in the top right.
On the next page, look for the Ordering and Shopping Preferences section, and click on the link under that heading that says “Download order reports.”Here is a simple data project tutorial that you could do using your own Amazon data to analyze your spending habits.
22. Facebook
Facebook also allows you to download your personal activity data. To access it, click this link (you’ll need to be logged in for it to work) and select the types of data you’d like to download.Here is an example of a simple data project you could build using your own personal Facebook data.
23. Netflix
Netflix allows you to request your own data for download, although it will make you jump through a few hoops, and will warn you that the process of collating your data may take 30 days. As of the last time we checked, the data they allow you to download is fairly limited, but it could still be suitable for some types of projects and analysis.
Extra Bonus: Powerful Dataset Search Tool
24. Google Dataset Search
OK, so this isn’t strictly a dataset – rather a search tool to find relevant datasets. As you already know, Google is a data powerhouse, so it makes sense that their search tool knocks the socks off of other ways to find specific datasets.
All you need to do is head over to Google Dataset Search and type a keyword or phrase related to the dataset you’re looking for in the search bar. The results will list all the datasets indexed on Google for that particular search term. The datasets are generally from high-quality sources, of which some are free and others available for a fee or subscription.
Next steps
In this post, we covered good places to find datasets for any type of data science project. We hope that you find something interesting that you want to sink your teeth into!
At Dataquest, our interactive guided projects are designed to help you start building a data science portfolio to demonstrate your skills to employers and get a job in data. If you’re interested, you can sign up and do our first module for free.
If you liked this, you might like to read the other posts in our ‘Build a Data Science Portfolio’ series:
- Storytelling with data
- How to set up up a data science blog
- Building a machine learning project
- The key to building a data science portfolio that will get you a job
How to present your data science portfolio on Github.
from Dataquest https://ift.tt/8jz1VFs
via RiYo Analytics
No comments