Page Nav

HIDE

Breaking News:

latest

Ads Place

SQL Queries You Need to Know (with Practical Examples)

https://ift.tt/wKdBpPx According to recent estimates, around 403 million terrabytes of data are produced by humans every day. As this volu...

https://ift.tt/wKdBpPx

According to recent estimates, around 403 million terrabytes of data are produced by humans every day. As this volume of data continues to grow, so does the need for a convenient way to organize, manage, and store it. Relational databases are a popular way to collect and store data. The Structured Query Language (SQL) is the standard programming language for managing and querying these databases.

Knowledge of SQL programming is a prerequisite skill for most data-related jobs. The hiring managers we spoke with shared that aspiring data professionals must demonstrate proficiency in SQL. Portfolio projects are the perfect opportunity to showcase your SQL skills to recruiters. When you enroll in our SQL Fundamentals skill path, you'll not only master the commonly used SQL commands, you'll also complete porfolio projects you can be proud of.

In this post, you'll learn how to do the following:

  • Write SQL queries to extract data from two or more databases
  • Use SQL dot commands to simplify tasks
  • Perform aggregation with GROUP BY and PARTITION BY commands
  • Use window functions to simplify complex SQL queries
  • Read SQL queries from file and send SQL query results to file

The knowledge you'll gain from this post will enable you to write more efficient SQL queries and automate SQL tasks using the command line, and give you a solid foundation to build upon when learning more advanced SQL concepts. You can download all the files you'll need for this post here.


1. Aggregation Over the Whole Table

When an aggregrate function is used without the GROUP BY clause, the aggregation is done over the entire table. In this task, we'll write a SQL query to perform aggregation over the reviews table.

Let's install ipython-sql and set up our notebook to capture information sent to the standard output and standard error, then load the sql extention:

First, in your terminal or shell, run:

!pip install ipython-sql

Next, run the following commands in separate cells in your notebook:

%%capture
%load_ext sql

Next, let's connect to the pitchfork.sqlite file and run a query:

%sql sqlite:///pitchfork.sqlite
* sqlite:///pitchfork.sqlite
Done.
%%sql
SELECT name, 
       sql
  FROM sqlite_master
 WHERE type = 'table' AND name = 'reviews'
name sql
reviews CREATE TABLE reviews (
reviewid INTEGER,
title TEXT,
artist TEXT,
url TEXT,
score REAL,
best_new_music INTEGER,
author TEXT,
author_type TEXT,
pub_date TEXT,
pub_weekday INTEGER,
pub_day INTEGER,
pub_month INTEGER,
pub_year INTEGER)

To derive key summary statistics, we can use aggregation over the whole table to better understand our data. The SQL query below calculates the reviews table stats:

%%sql
SELECT COUNT(*) AS total_row_count,
       ROUND(AVG(score), 2) AS avg_review_score,
       MAX(score) AS max_review_score,
       MIN(score) AS min_review_score,
  FROM reviews;
total_row_count avg_review_score max_review_score min_review_score
18393 7.01 10.0 0.0

2. Understanding Aggregation with GROUP BY and PARTITION BY

One of the conditions for the First Normal Form (1NF) is that the rows of a table in a database have no meaningful order. This means that we should use the ORDER BY command instead of relying on the SQL engine to return data in a particular sequence.

Another important concept to understand when using GROUP BY is the idea of a bare column. When we group data, only the columns explicitly listed in the GROUP BY clause can be used directly in the SELECT statement. Any other column that is not used for grouping must be wrapped in an aggregate function such as AVG(), SUM(), or COUNT(). If a column appears in the SELECT statement without an aggregate function and is not included in the grouping, it's called a bare column.

SQLite, unlike some other databases, does not strictly enforce this rule and will return a seemingly arbitrary row’s value for the bare column within each group. In reality, SQLite returns the first encountered row in the group, but this behavior is not guaranteed and should not be relied upon for correctness.

Let's look at an example of this behavior in action:

%%sql
SELECT author, 
       pub_year,
       title,
       score
  FROM reviews
 GROUP BY author, pub_year
HAVING author = 'jonah bromwich'
 ORDER BY reviewid ASC;
author pub_year title score
jonah bromwich 2012 the crystal ark 5.6
jonah bromwich 2013 sunset blvd. 7.3
jonah bromwich 2014 the church 4.9
jonah bromwich 2015 technoself 7.2
jonah bromwich 2016 yes lawd! 8.2

In this query, we're using GROUP BY to aggregate the data by author and pub_year. Since title and score are not part of the grouping, they are bare columns. SQLite resolves this by returning the first row it encounters for each group based on the order of reviewid. This behavior can lead to inconsistent or unintended results when retrieving data.

To confirm SQLite’s handling of bare columns, we can explicitly assign row numbers to each subset of the data using PARTITION BY. Unlike GROUP BY, PARTITION BY does not collapse rows—it allows us to maintain all records while applying window functions:

%%sql
SELECT author,
       pub_year,
       title,
       score,
       ROW_NUMBER() OVER (PARTITION BY author, pub_year) AS row_number
  FROM reviews
 WHERE author = 'jonah bromwich'
 ORDER BY row_number ASC
 LIMIT 5;
author pub_year title score row_number
jonah bromwich 2012 the crystal ark 5.6 1
jonah bromwich 2013 sunset blvd. 7.3 1
jonah bromwich 2014 the church 4.9 1
jonah bromwich 2015 technoself 7.2 1
jonah bromwich 2016 yes lawd! 8.2 1

Since we now explicitly assign row numbers, we can see that SQLite is selecting the first row in each partition. Now, let's analyze a specific year where jonah bromwich had multiple reviews published:

%%sql
SELECT author,
       pub_year,
       title,
       score,
       ROW_NUMBER() OVER (PARTITION BY author, pub_year) AS row_number
  FROM reviews
 WHERE author = 'jonah bromwich' AND pub_year = 2015
 ORDER BY row_number ASC;
author pub_year title score row_number
jonah bromwich 2015 technoself 7.2 1
jonah bromwich 2015 know it all 5.5 2
jonah bromwich 2015 howl 7.7 3
jonah bromwich 2015 timeline 6.3 4
jonah bromwich 2015 four pink walls ep 6.6 5
jonah bromwich 2015 the life and times of raphael de la ghett 6.8 6
jonah bromwich 2015 begging please 7.5 7
jonah bromwich 2015 dark red 5.9 8
jonah bromwich 2015 dena tape 2 5.7 9
jonah bromwich 2015 sunset mountain 6.4 10

In 2015, jonah bromwich published 10 records for review. The average for this review calculated by hand is: $$ {(7.2+5.5+7.7+6.3+6.6+6.8+7.5+5.9+5.7+6.4) \over 10 } = {65.6 \over 10} = 6.56 $$ In the following SQL query, we'll use aggregate functions to derive the publication counts (COUNT) and their average score (AVG) per year and arrive at the same results as our manual calculation above:

%%sql
SELECT author,
       pub_year,
       COUNT(*) AS count_pub,
       ROUND(AVG(score), 2) AS avg_score
  FROM reviews
 GROUP BY author, pub_year
HAVING author = 'jonah bromwich';
author pub_year count_pub avg_score
jonah bromwich 2012 13 7.04
jonah bromwich 2013 22 6.89
jonah bromwich 2014 22 6.62
jonah bromwich 2015 10 6.56
jonah bromwich 2016 14 7.28

This ensures we correctly aggregate data without relying on SQLite's handling of bare columns. Using aggregate functions properly prevents data inconsistencies and makes results predictable.

3. Querying Two Databases and Truncating the Output

Sometimes, the question that we want to answer may require that we query two or more databases. In this example, we'll write a simple SQL query to return the number of NIPS papers published and number of Pitchfork reviews completed between 2010 and 2015.

To do so, we'll continue to work with the pitchfork database. This is our main database connection. However, we can attach several databases to our main connection with the ATTACH DATABASE SQL command.

Let's now attach the nips database to our main connection:

%%sql
ATTACH DATABASE 'nips.sqlite' AS nips;
* sqlite:///pitchfork.sqlite
Done.

If you make this database connection using the SQLite command line tool and run the .databases command, you'll get the following list of databases:

1.PNG

Let's restate that the name of the pitchfork database connection is main, and the name of the attached connection is nips. To query the databases we use the following syntax:

SELECT * FROM database_name.table_name

To accomplish our task, we'll use a Common Table Expression (CTE) to get our results from the two databases, and then combine the results in our main SQL query:

%%sql
-- Start of CTE
WITH 
    no_of_papers AS (
        SELECT 
            year,
            COUNT(*) AS number_of_papers
        FROM nips.papers
        GROUP BY year
    ), 
    no_of_reviews AS (
        SELECT 
            pub_year AS year,
            COUNT(*) AS number_of_reviews
        FROM main.reviews
        GROUP BY pub_year
    )
-- End of CTE
%%sql
SELECT *
  FROM no_of_papers
  FULL OUTER JOIN no_of_reviews
 USING (year)
 LIMIT 6 OFFSET 23;
year number_of_papers number_of_reviews
2010 292 1170
2011 306 1165
2012 368 1185
2013 360 1200
2014 411 1162
2015 403 1135

The LIMIT X OFFSET Y command enables us to truncate our result. LIMIT 6 means we want to return only 6 rows (2010 to 2015 inclusive), and OFFSET 23 means that the first result should start 23 rows from the top (that's from 2010).

Since we're done using this data, let's detach the nips database connection from the main connection:

%sql DETACH DATABASE nips;

And finally, close the main connection:

%sql --close sqlite:///pitchfork.sqlite

4. Run SQL Query from File and Send SQL Query Output to File

You may find that you need to execute a SQL query from a text file (e.g., a .sql file). In this example, we want to query the chinook.db database to get the top 5 spenders of Jazz music from the top_spenders_jazz.sql file.

We can achieve this in several ways. The first way is to run the following from the command line:

sqlite3 -column -header chinook.db < top_spenders_jazz.sql
    full_name         amount_spent
    ----------------  ------------
    Fernanda Ramos    15.84       
    Enrique Muñoz     15.84       
    Hannah Schneider  14.85       
    Astrid Gruber     14.85       
    Kara Nielsen      3.96        

The above command executes the query in the top_spenders_jazz.sql file using the chinook.db database file. In the second method, we'll first connect to the database file and execute the query using the sql magic function:

%sql sqlite:///chinook.db

Next, we'll use the sql magic equivalent of the .read command to execute the query:

%sql -f ./top_spenders_jazz.sql
* sqlite:///chinook.db
Done.
full_name amount_spent
Fernanda Ramos 15.84
Enrique Muñoz 15.84
Hannah Schneider 14.85
Astrid Gruber 14.85
Kara Nielsen 3.96

If you're wondering how the .read command works, it's similar to the second method shown above. You first connect to the database, then read the query with the .read command:

2.PNG

What if you want to send the output of the query as a file? If you have some knowledge of Python, you can:

  • Assign the SQL query ouput to a variable name
  • Write the output to a file
# Assign the query output to the variable 'result'
result = 

# Print the result
print(result)

# Write the output to sql file format
with open('names_of_jazz_top_spenders.sql', 'w') as file:
    file.write(str(result))
     * sqlite:///chinook.db
    Done.
    +------------------+--------------+
    |    full_name     | amount_spent |
    +------------------+--------------+
    |  Fernanda Ramos  |    15.84     |
    |  Enrique Muñoz   |    15.84     |
    | Hannah Schneider |    14.85     |
    |  Astrid Gruber   |    14.85     |
    |   Kara Nielsen   |     3.96     |
    +------------------+--------------+

It's even easier to write to a file using the .output command:

3.PNG

The .output name_of_file redirects the standard output to the file. The output of the .read command is now saved in the file. It doesn't print anything to screen. To discontinue writing to the file and start writing to the standard output (or printing to screen), run .output stdout. The file names_of_jazz_top_spenders_cli.sql will appear in your working directory with the result of the query inside.

5. Simplifying Aggregation with Window Functions

You may find yourself writing subqueries and using GROUP BY when aggregating with SQL. The SQL window functions and the PARTITION BY clause simplify these types of complex SQL queries.

In this example, we'll investigate the best selling album in the chinook database. We'll see how to calculate the best selling album using GROUP BY and using window functions.

First, let's create a virtual table, album_track_purchase, with the CREATE VIEW command that will contain only the data we require for our analysis:

%%sql
CREATE VIEW album_track_purchase AS
    SELECT 
        al.album_id,
        al.title AS album_title,
        tr.name AS track_name,
        il.unit_price * il.quantity AS amount
    FROM album AS al
    INNER JOIN track AS tr
        USING (album_id)
    INNER JOIN invoice_line AS il
        USING (track_id)
        ORDER BY 1;

Next, let's preview the view we just created:

%sql SELECT * FROM album_track_purchase LIMIT 5;
album_id album_title track_name amount
1 For Those About To Rock We Salute You Night Of The Long Knives 0.99
1 For Those About To Rock We Salute You For Those About To Rock (We Salute You) 0.99
1 For Those About To Rock We Salute You Put The Finger On You 0.99
1 For Those About To Rock We Salute You Let's Get It Up 0.99
1 For Those About To Rock We Salute You Inject The Venom 0.99

To get the best performing album, we need to count how many tracks were sold from the album, the revenue generated, and the percent revenue.

%%sql
SELECT album_title,
       COUNT(*) AS count,
       ROUND(SUM(amount), 2) AS revenue,
       ROUND(100 * SUM(amount) / (SELECT SUM(amount) FROM album_track_purchase), 2) AS percent_revenue
  FROM album_track_purchase
 GROUP BY album_title
 ORDER BY percent_revenue DESC
 LIMIT 5;
album_title count revenue percent_revenue
Are You Experienced? 187 185.13 3.93
Faceless 96 95.04 2.02
Mezmerize 93 92.07 1.96
Get Born 90 89.1 1.89
The Doors 83 82.17 1.74

In the SQL query above, we grouped the table according to the album_title. So COUNT(*) counts all the rows in a particular album. The result is the total number of tracks sold from that album. The SQL query SUM(amount) sums the total revenue generated by the sales of the tracks for a particular album.

To calculate percent_revenue, we need the total revenue generated from a particular album, divided by the total revenue generated from the sales of tracks from all the albums. Since we're grouping our data using album_title, we can calculate the former but not the latter.

To calculate the total revenue generated from the sales of tracks from all the albums, we'll use the following subquery: SELECT SUM(amount) FROM album_track_purchase). Did you notice that we're using an aggregration function SUM without a GROUP BYclause?

Remember that when we do this, we're aggregating over the entire table. Our subsquery correctly calculates the grand total of revenue from the table.

Next, let's see how to do this with windows functions and PARTITION BY instead of GROUP BY:

%%sql
SELECT DISTINCT album_title,
       COUNT(*) OVER (PARTITION BY album_title) AS count,
       ROUND(SUM(amount) OVER (PARTITION BY album_title), 2) AS revenue,
       ROUND(100 * SUM(amount) OVER (PARTITION BY album_title) / SUM(amount) OVER(), 2) AS percent_revenue
  FROM album_track_purchase
 ORDER BY percent_revenue DESC
 LIMIT 5;
album_title count revenue percent_revenue
Are You Experienced? 187 185.13 3.93
Faceless 96 95.04 2.02
Mezmerize 93 92.07 1.96
Get Born 90 89.1 1.89
The Doors 83 82.17 1.74

In the above SQL query, we're partitioning the table with the album_title. Therefore, related album titles are listed next to each other. Since we'll be performing aggregation, we need to return only a single album title, not all the times the title appears in the data, which is why we use the DISTINCT album_title.

The SQL query COUNT (*) OVER (PARTITION BY album_title) counts all the times (rows) a particular album title appears in the table, and the query SUM(amount) OVER(PARTITION BY album_title) sums all the amounts for these rows to get the revenue for that album.

But how can we calculate the total revenue for the entire table? Remember that we used a subsquery to do this previously.

When we use a window function without specifying PARTITION BY inside OVER(), the window function performs the aggregation over the entire table. Therefore, the SQL query SUM(amount) OVER() sums the revenue over the entire table. This is definitely easier than writing a subquery!

Since we're done with our data, let's drop the view we created and close the database connection:

%sql DROP VIEW album_track_purchase;
* sqlite:///chinook.db
Done.
%sql --close sqlite:///chinook.db

Takeaways

In this post, we learned:

  • How to aggregate data over an entire table
  • How to aggregate with GROUP BY and PARTITION BY
  • How to use SQL dot commands
  • How to read SQL query from file and write output to file
  • How to use SQL windows functions

We have taken a deep dive into how to write efficient SQL queries and provided an introduction to some advanced SQL functions. What you've learned from this post will help you explore SQL more confidently.

If you want to learn more, enroll in our SQL Fundamentals skill path to bolster your skills. You'll learn how to:

In addition, you'll complete several SQL portfolio projects to demonstrate your SQL competencies to recruiters as you begin your job search. Knowledge of SQL programming will open a host of opportunities for you in the field of data analytics, data science, and data engineering. We encourage you to sign up for one of our SQL courses and begin your SQL Developer journey with us.

An amazing SQL Developer is hard to find, but impossible to forget



from Dataquest https://ift.tt/fVUzwlN
via RiYo Analytics

No comments

Latest Articles