https://ift.tt/wKdBpPx According to recent estimates, around 403 million terrabytes of data are produced by humans every day. As this volu...
According to recent estimates, around 403 million terrabytes of data are produced by humans every day. As this volume of data continues to grow, so does the need for a convenient way to organize, manage, and store it. Relational databases are a popular way to collect and store data. The Structured Query Language (SQL) is the standard programming language for managing and querying these databases.
Knowledge of SQL programming is a prerequisite skill for most data-related jobs. The hiring managers we spoke with shared that aspiring data professionals must demonstrate proficiency in SQL. Portfolio projects are the perfect opportunity to showcase your SQL skills to recruiters. When you enroll in our SQL Fundamentals skill path, you'll not only master the commonly used SQL commands, you'll also complete porfolio projects you can be proud of.
In this post, you'll learn how to do the following:
- Write SQL queries to extract data from two or more databases
- Use SQL dot commands to simplify tasks
- Perform aggregation with
GROUP BY
andPARTITION BY
commands - Use window functions to simplify complex SQL queries
- Read SQL queries from file and send SQL query results to file
The knowledge you'll gain from this post will enable you to write more efficient SQL queries and automate SQL tasks using the command line, and give you a solid foundation to build upon when learning more advanced SQL concepts. You can download all the files you'll need for this post here.
1. Aggregation Over the Whole Table
When an aggregrate function is used without the GROUP BY
clause, the aggregation is done over the entire table. In this task, we'll write a SQL query to perform aggregation over the reviews
table.
Let's install ipython-sql
and set up our notebook to capture information sent to the standard output and standard error, then load the sql
extention:
First, in your terminal or shell, run:
!pip install ipython-sql
Next, run the following commands in separate cells in your notebook:
%%capture
%load_ext sql
Next, let's connect to the pitchfork.sqlite
file and run a query:
%sql sqlite:///pitchfork.sqlite
* sqlite:///pitchfork.sqlite
Done.
%%sql
SELECT name,
sql
FROM sqlite_master
WHERE type = 'table' AND name = 'reviews'
name | sql |
---|---|
reviews | CREATE TABLE reviews ( reviewid INTEGER, title TEXT, artist TEXT, url TEXT, score REAL, best_new_music INTEGER, author TEXT, author_type TEXT, pub_date TEXT, pub_weekday INTEGER, pub_day INTEGER, pub_month INTEGER, pub_year INTEGER) |
To derive key summary statistics, we can use aggregation over the whole table to better understand our data. The SQL query below calculates the reviews
table stats:
%%sql
SELECT COUNT(*) AS total_row_count,
ROUND(AVG(score), 2) AS avg_review_score,
MAX(score) AS max_review_score,
MIN(score) AS min_review_score,
FROM reviews;
total_row_count | avg_review_score | max_review_score | min_review_score |
---|---|---|---|
18393 | 7.01 | 10.0 | 0.0 |
2. Understanding Aggregation with GROUP BY
and PARTITION BY
One of the conditions for the First Normal Form (1NF) is that the rows of a table in a database have no meaningful order. This means that we should use the ORDER BY
command instead of relying on the SQL engine to return data in a particular sequence.
Another important concept to understand when using GROUP BY
is the idea of a bare column. When we group data, only the columns explicitly listed in the GROUP BY
clause can be used directly in the SELECT
statement. Any other column that is not used for grouping must be wrapped in an aggregate function such as AVG()
, SUM()
, or COUNT()
. If a column appears in the SELECT
statement without an aggregate function and is not included in the grouping, it's called a bare column.
SQLite, unlike some other databases, does not strictly enforce this rule and will return a seemingly arbitrary row’s value for the bare column within each group. In reality, SQLite returns the first encountered row in the group, but this behavior is not guaranteed and should not be relied upon for correctness.
Let's look at an example of this behavior in action:
%%sql
SELECT author,
pub_year,
title,
score
FROM reviews
GROUP BY author, pub_year
HAVING author = 'jonah bromwich'
ORDER BY reviewid ASC;
author | pub_year | title | score |
---|---|---|---|
jonah bromwich | 2012 | the crystal ark | 5.6 |
jonah bromwich | 2013 | sunset blvd. | 7.3 |
jonah bromwich | 2014 | the church | 4.9 |
jonah bromwich | 2015 | technoself | 7.2 |
jonah bromwich | 2016 | yes lawd! | 8.2 |
In this query, we're using GROUP BY
to aggregate the data by author
and pub_year
. Since title
and score
are not part of the grouping, they are bare columns. SQLite resolves this by returning the first row it encounters for each group based on the order of reviewid
. This behavior can lead to inconsistent or unintended results when retrieving data.
To confirm SQLite’s handling of bare columns, we can explicitly assign row numbers to each subset of the data using PARTITION BY
. Unlike GROUP BY
, PARTITION BY
does not collapse rows—it allows us to maintain all records while applying window functions:
%%sql
SELECT author,
pub_year,
title,
score,
ROW_NUMBER() OVER (PARTITION BY author, pub_year) AS row_number
FROM reviews
WHERE author = 'jonah bromwich'
ORDER BY row_number ASC
LIMIT 5;
author | pub_year | title | score | row_number |
---|---|---|---|---|
jonah bromwich | 2012 | the crystal ark | 5.6 | 1 |
jonah bromwich | 2013 | sunset blvd. | 7.3 | 1 |
jonah bromwich | 2014 | the church | 4.9 | 1 |
jonah bromwich | 2015 | technoself | 7.2 | 1 |
jonah bromwich | 2016 | yes lawd! | 8.2 | 1 |
Since we now explicitly assign row numbers, we can see that SQLite is selecting the first row in each partition. Now, let's analyze a specific year where jonah bromwich
had multiple reviews published:
%%sql
SELECT author,
pub_year,
title,
score,
ROW_NUMBER() OVER (PARTITION BY author, pub_year) AS row_number
FROM reviews
WHERE author = 'jonah bromwich' AND pub_year = 2015
ORDER BY row_number ASC;
author | pub_year | title | score | row_number |
---|---|---|---|---|
jonah bromwich | 2015 | technoself | 7.2 | 1 |
jonah bromwich | 2015 | know it all | 5.5 | 2 |
jonah bromwich | 2015 | howl | 7.7 | 3 |
jonah bromwich | 2015 | timeline | 6.3 | 4 |
jonah bromwich | 2015 | four pink walls ep | 6.6 | 5 |
jonah bromwich | 2015 | the life and times of raphael de la ghett | 6.8 | 6 |
jonah bromwich | 2015 | begging please | 7.5 | 7 |
jonah bromwich | 2015 | dark red | 5.9 | 8 |
jonah bromwich | 2015 | dena tape 2 | 5.7 | 9 |
jonah bromwich | 2015 | sunset mountain | 6.4 | 10 |
In 2015, jonah bromwich
published 10 records for review. The average for this review calculated by hand is: $$ {(7.2+5.5+7.7+6.3+6.6+6.8+7.5+5.9+5.7+6.4) \over 10 } = {65.6 \over 10} = 6.56 $$ In the following SQL query, we'll use aggregate functions to derive the publication counts (COUNT
) and their average score (AVG
) per year and arrive at the same results as our manual calculation above:
%%sql
SELECT author,
pub_year,
COUNT(*) AS count_pub,
ROUND(AVG(score), 2) AS avg_score
FROM reviews
GROUP BY author, pub_year
HAVING author = 'jonah bromwich';
author | pub_year | count_pub | avg_score |
---|---|---|---|
jonah bromwich | 2012 | 13 | 7.04 |
jonah bromwich | 2013 | 22 | 6.89 |
jonah bromwich | 2014 | 22 | 6.62 |
jonah bromwich | 2015 | 10 | 6.56 |
jonah bromwich | 2016 | 14 | 7.28 |
This ensures we correctly aggregate data without relying on SQLite's handling of bare columns. Using aggregate functions properly prevents data inconsistencies and makes results predictable.
3. Querying Two Databases and Truncating the Output
Sometimes, the question that we want to answer may require that we query two or more databases. In this example, we'll write a simple SQL query to return the number of NIPS papers published and number of Pitchfork reviews completed between 2010 and 2015.
To do so, we'll continue to work with the pitchfork
database. This is our main database connection. However, we can attach several databases to our main connection with the ATTACH DATABASE
SQL command.
Let's now attach the nips
database to our main connection:
%%sql
ATTACH DATABASE 'nips.sqlite' AS nips;
* sqlite:///pitchfork.sqlite
Done.
If you make this database connection using the SQLite command line tool and run the .databases
command, you'll get the following list of databases:
Let's restate that the name of the pitchfork
database connection is main
, and the name of the attached connection is nips
. To query the databases we use the following syntax:
SELECT * FROM database_name.table_name
To accomplish our task, we'll use a Common Table Expression (CTE) to get our results from the two databases, and then combine the results in our main SQL query:
%%sql
-- Start of CTE
WITH
no_of_papers AS (
SELECT
year,
COUNT(*) AS number_of_papers
FROM nips.papers
GROUP BY year
),
no_of_reviews AS (
SELECT
pub_year AS year,
COUNT(*) AS number_of_reviews
FROM main.reviews
GROUP BY pub_year
)
-- End of CTE
%%sql
SELECT *
FROM no_of_papers
FULL OUTER JOIN no_of_reviews
USING (year)
LIMIT 6 OFFSET 23;
year | number_of_papers | number_of_reviews |
---|---|---|
2010 | 292 | 1170 |
2011 | 306 | 1165 |
2012 | 368 | 1185 |
2013 | 360 | 1200 |
2014 | 411 | 1162 |
2015 | 403 | 1135 |
The LIMIT X OFFSET Y
command enables us to truncate our result. LIMIT 6
means we want to return only 6 rows (2010 to 2015 inclusive), and OFFSET 23
means that the first result should start 23 rows from the top (that's from 2010).
Since we're done using this data, let's detach the nips
database connection from the main connection:
%sql DETACH DATABASE nips;
And finally, close the main connection:
%sql --close sqlite:///pitchfork.sqlite
4. Run SQL Query from File and Send SQL Query Output to File
You may find that you need to execute a SQL query from a text file (e.g., a .sql
file). In this example, we want to query the chinook.db
database to get the top 5 spenders of Jazz music from the top_spenders_jazz.sql
file.
We can achieve this in several ways. The first way is to run the following from the command line:
sqlite3 -column -header chinook.db < top_spenders_jazz.sql
full_name amount_spent
---------------- ------------
Fernanda Ramos 15.84
Enrique Muñoz 15.84
Hannah Schneider 14.85
Astrid Gruber 14.85
Kara Nielsen 3.96
The above command executes the query in the top_spenders_jazz.sql
file using the chinook.db
database file. In the second method, we'll first connect to the database file and execute the query using the sql magic function:
%sql sqlite:///chinook.db
Next, we'll use the sql magic equivalent of the .read
command to execute the query:
%sql -f ./top_spenders_jazz.sql
* sqlite:///chinook.db
Done.
full_name | amount_spent |
---|---|
Fernanda Ramos | 15.84 |
Enrique Muñoz | 15.84 |
Hannah Schneider | 14.85 |
Astrid Gruber | 14.85 |
Kara Nielsen | 3.96 |
If you're wondering how the .read
command works, it's similar to the second method shown above. You first connect to the database, then read the query with the .read
command:
What if you want to send the output of the query as a file? If you have some knowledge of Python, you can:
- Assign the SQL query ouput to a variable name
- Write the output to a file
# Assign the query output to the variable 'result'
result =
# Print the result
print(result)
# Write the output to sql file format
with open('names_of_jazz_top_spenders.sql', 'w') as file:
file.write(str(result))
* sqlite:///chinook.db
Done.
+------------------+--------------+
| full_name | amount_spent |
+------------------+--------------+
| Fernanda Ramos | 15.84 |
| Enrique Muñoz | 15.84 |
| Hannah Schneider | 14.85 |
| Astrid Gruber | 14.85 |
| Kara Nielsen | 3.96 |
+------------------+--------------+
It's even easier to write to a file using the .output
command:
The .output name_of_file
redirects the standard output to the file. The output of the .read
command is now saved in the file. It doesn't print anything to screen. To discontinue writing to the file and start writing to the standard output (or printing to screen), run .output stdout
. The file names_of_jazz_top_spenders_cli.sql
will appear in your working directory with the result of the query inside.
5. Simplifying Aggregation with Window Functions
You may find yourself writing subqueries and using GROUP BY
when aggregating with SQL. The SQL window functions and the PARTITION BY
clause simplify these types of complex SQL queries.
In this example, we'll investigate the best selling album in the chinook
database. We'll see how to calculate the best selling album using GROUP BY
and using window functions.
First, let's create a virtual table, album_track_purchase
, with the CREATE VIEW
command that will contain only the data we require for our analysis:
%%sql
CREATE VIEW album_track_purchase AS
SELECT
al.album_id,
al.title AS album_title,
tr.name AS track_name,
il.unit_price * il.quantity AS amount
FROM album AS al
INNER JOIN track AS tr
USING (album_id)
INNER JOIN invoice_line AS il
USING (track_id)
ORDER BY 1;
Next, let's preview the view we just created:
%sql SELECT * FROM album_track_purchase LIMIT 5;
album_id | album_title | track_name | amount |
---|---|---|---|
1 | For Those About To Rock We Salute You | Night Of The Long Knives | 0.99 |
1 | For Those About To Rock We Salute You | For Those About To Rock (We Salute You) | 0.99 |
1 | For Those About To Rock We Salute You | Put The Finger On You | 0.99 |
1 | For Those About To Rock We Salute You | Let's Get It Up | 0.99 |
1 | For Those About To Rock We Salute You | Inject The Venom | 0.99 |
To get the best performing album, we need to count how many tracks were sold from the album, the revenue generated, and the percent revenue.
%%sql
SELECT album_title,
COUNT(*) AS count,
ROUND(SUM(amount), 2) AS revenue,
ROUND(100 * SUM(amount) / (SELECT SUM(amount) FROM album_track_purchase), 2) AS percent_revenue
FROM album_track_purchase
GROUP BY album_title
ORDER BY percent_revenue DESC
LIMIT 5;
album_title | count | revenue | percent_revenue |
---|---|---|---|
Are You Experienced? | 187 | 185.13 | 3.93 |
Faceless | 96 | 95.04 | 2.02 |
Mezmerize | 93 | 92.07 | 1.96 |
Get Born | 90 | 89.1 | 1.89 |
The Doors | 83 | 82.17 | 1.74 |
In the SQL query above, we grouped the table according to the album_title
. So COUNT(*)
counts all the rows in a particular album. The result is the total number of tracks sold from that album. The SQL query SUM(amount)
sums the total revenue generated by the sales of the tracks for a particular album.
To calculate percent_revenue
, we need the total revenue generated from a particular album, divided by the total revenue generated from the sales of tracks from all the albums. Since we're grouping our data using album_title
, we can calculate the former but not the latter.
To calculate the total revenue generated from the sales of tracks from all the albums, we'll use the following subquery: SELECT SUM(amount) FROM album_track_purchase)
. Did you notice that we're using an aggregration function SUM
without a GROUP BY
clause?
Remember that when we do this, we're aggregating over the entire table. Our subsquery correctly calculates the grand total of revenue from the table.
Next, let's see how to do this with windows functions and PARTITION BY
instead of GROUP BY
:
%%sql
SELECT DISTINCT album_title,
COUNT(*) OVER (PARTITION BY album_title) AS count,
ROUND(SUM(amount) OVER (PARTITION BY album_title), 2) AS revenue,
ROUND(100 * SUM(amount) OVER (PARTITION BY album_title) / SUM(amount) OVER(), 2) AS percent_revenue
FROM album_track_purchase
ORDER BY percent_revenue DESC
LIMIT 5;
album_title | count | revenue | percent_revenue |
---|---|---|---|
Are You Experienced? | 187 | 185.13 | 3.93 |
Faceless | 96 | 95.04 | 2.02 |
Mezmerize | 93 | 92.07 | 1.96 |
Get Born | 90 | 89.1 | 1.89 |
The Doors | 83 | 82.17 | 1.74 |
In the above SQL query, we're partitioning the table with the album_title
. Therefore, related album titles are listed next to each other. Since we'll be performing aggregation, we need to return only a single album title, not all the times the title appears in the data, which is why we use the DISTINCT album_title
.
The SQL query COUNT (*) OVER (PARTITION BY album_title)
counts all the times (rows) a particular album title appears in the table, and the query SUM(amount) OVER(PARTITION BY album_title)
sums all the amounts for these rows to get the revenue for that album.
But how can we calculate the total revenue for the entire table? Remember that we used a subsquery to do this previously.
When we use a window function without specifying PARTITION BY
inside OVER()
, the window function performs the aggregation over the entire table. Therefore, the SQL query SUM(amount) OVER()
sums the revenue over the entire table. This is definitely easier than writing a subquery!
Since we're done with our data, let's drop the view we created and close the database connection:
%sql DROP VIEW album_track_purchase;
* sqlite:///chinook.db
Done.
%sql --close sqlite:///chinook.db
Takeaways
In this post, we learned:
- How to aggregate data over an entire table
- How to aggregate with
GROUP BY
andPARTITION BY
- How to use SQL dot commands
- How to read SQL query from file and write output to file
- How to use SQL windows functions
We have taken a deep dive into how to write efficient SQL queries and provided an introduction to some advanced SQL functions. What you've learned from this post will help you explore SQL more confidently.
If you want to learn more, enroll in our SQL Fundamentals skill path to bolster your skills. You'll learn how to:
- Perform joining operations
- Use aggregate functions
- Write subqueries and CTEs
- Filter and sort data
- And more...
In addition, you'll complete several SQL portfolio projects to demonstrate your SQL competencies to recruiters as you begin your job search. Knowledge of SQL programming will open a host of opportunities for you in the field of data analytics, data science, and data engineering. We encourage you to sign up for one of our SQL courses and begin your SQL Developer journey with us.
An amazing SQL Developer is hard to find, but impossible to forget
from Dataquest https://ift.tt/fVUzwlN
via RiYo Analytics
No comments