SQL Queries You Need to Know (with Practical Examples)

https://ift.tt/wKdBpPx According to recent estimates, around 403 million terrabytes of data are produced by humans every day. As this volu...

https://ift.tt/wKdBpPx

According to recent estimates, around 403 million terrabytes of data are produced by humans every day. As this volume of data continues to grow, so does the need for a convenient way to organize, manage, and store it. Relational databases are a popular way to collect and store data. The Structured Query Language (SQL) is the standard programming language for managing and querying these databases.

Knowledge of SQL programming is a prerequisite skill for most data-related jobs. The hiring managers we spoke with shared that aspiring data professionals must demonstrate proficiency in SQL. Portfolio projects are the perfect opportunity to showcase your SQL skills to recruiters. When you enroll in our SQL Fundamentals skill path, you'll not only master the commonly used SQL commands, you'll also complete porfolio projects you can be proud of.

In this post, you'll learn how to do the following:

Write SQL queries to extract data from two or more databases
Use SQL dot commands to simplify tasks
Perform aggregation with GROUP BY and PARTITION BY commands
Use window functions to simplify complex SQL queries
Read SQL queries from file and send SQL query results to file

The knowledge you'll gain from this post will enable you to write more efficient SQL queries and automate SQL tasks using the command line, and give you a solid foundation to build upon when learning more advanced SQL concepts. You can download all the files you'll need for this post here.

1. Aggregation Over the Whole Table

When an aggregrate function is used without the GROUP BY clause, the aggregation is done over the entire table. In this task, we'll write a SQL query to perform aggregation over the reviews table.

Let's install ipython-sql and set up our notebook to capture information sent to the standard output and standard error, then load the sql extention:

First, in your terminal or shell, run:

!pip install ipython-sql

Next, run the following commands in separate cells in your notebook:

%%capture

%load_ext sql

Next, let's connect to the pitchfork.sqlite file and run a query:

%sql sqlite:///pitchfork.sqlite

* sqlite:///pitchfork.sqlite
Done.

%%sql
SELECT name, 
       sql
  FROM sqlite_master
 WHERE type = 'table' AND name = 'reviews'

name	sql
reviews	CREATE TABLE reviews ( reviewid INTEGER, title TEXT, artist TEXT, url TEXT, score REAL, best_new_music INTEGER, author TEXT, author_type TEXT, pub_date TEXT, pub_weekday INTEGER, pub_day INTEGER, pub_month INTEGER, pub_year INTEGER)

To derive key summary statistics, we can use aggregation over the whole table to better understand our data. The SQL query below calculates the reviews table stats:

%%sql
SELECT COUNT(*) AS total_row_count,
       ROUND(AVG(score), 2) AS avg_review_score,
       MAX(score) AS max_review_score,
       MIN(score) AS min_review_score,
  FROM reviews;

total_row_count	avg_review_score	max_review_score	min_review_score
18393	7.01	10.0	0.0

2. Understanding Aggregation with `GROUP BY` and `PARTITION BY`

One of the conditions for the First Normal Form (1NF) is that the rows of a table in a database have no meaningful order. This means that we should use the ORDER BY command instead of relying on the SQL engine to return data in a particular sequence.

Another important concept to understand when using GROUP BY is the idea of a bare column. When we group data, only the columns explicitly listed in the GROUP BY clause can be used directly in the SELECT statement. Any other column that is not used for grouping must be wrapped in an aggregate function such as AVG(), SUM(), or COUNT(). If a column appears in the SELECT statement without an aggregate function and is not included in the grouping, it's called a bare column.

SQLite, unlike some other databases, does not strictly enforce this rule and will return a seemingly arbitrary row’s value for the bare column within each group. In reality, SQLite returns the first encountered row in the group, but this behavior is not guaranteed and should not be relied upon for correctness.

Let's look at an example of this behavior in action:

%%sql
SELECT author, 
       pub_year,
       title,
       score
  FROM reviews
 GROUP BY author, pub_year
HAVING author = 'jonah bromwich'
 ORDER BY reviewid ASC;

author	pub_year	title	score
jonah bromwich	2012	the crystal ark	5.6
jonah bromwich	2013	sunset blvd.	7.3
jonah bromwich	2014	the church	4.9
jonah bromwich	2015	technoself	7.2
jonah bromwich	2016	yes lawd!	8.2

In this query, we're using GROUP BY to aggregate the data by author and pub_year. Since title and score are not part of the grouping, they are bare columns. SQLite resolves this by returning the first row it encounters for each group based on the order of reviewid. This behavior can lead to inconsistent or unintended results when retrieving data.

To confirm SQLite’s handling of bare columns, we can explicitly assign row numbers to each subset of the data using PARTITION BY. Unlike GROUP BY, PARTITION BY does not collapse rows—it allows us to maintain all records while applying window functions:

%%sql
SELECT author,
       pub_year,
       title,
       score,
       ROW_NUMBER() OVER (PARTITION BY author, pub_year) AS row_number
  FROM reviews
 WHERE author = 'jonah bromwich'
 ORDER BY row_number ASC
 LIMIT 5;

author	pub_year	title	score	row_number
jonah bromwich	2012	the crystal ark	5.6	1
jonah bromwich	2013	sunset blvd.	7.3	1
jonah bromwich	2014	the church	4.9	1
jonah bromwich	2015	technoself	7.2	1
jonah bromwich	2016	yes lawd!	8.2	1

Since we now explicitly assign row numbers, we can see that SQLite is selecting the first row in each partition. Now, let's analyze a specific year where jonah bromwich had multiple reviews published:

%%sql
SELECT author,
       pub_year,
       title,
       score,
       ROW_NUMBER() OVER (PARTITION BY author, pub_year) AS row_number
  FROM reviews
 WHERE author = 'jonah bromwich' AND pub_year = 2015
 ORDER BY row_number ASC;

author	pub_year	title	score	row_number
jonah bromwich	2015	technoself	7.2	1
jonah bromwich	2015	know it all	5.5	2
jonah bromwich	2015	howl	7.7	3
jonah bromwich	2015	timeline	6.3	4
jonah bromwich	2015	four pink walls ep	6.6	5
jonah bromwich	2015	the life and times of raphael de la ghett	6.8	6
jonah bromwich	2015	begging please	7.5	7
jonah bromwich	2015	dark red	5.9	8
jonah bromwich	2015	dena tape 2	5.7	9
jonah bromwich	2015	sunset mountain	6.4	10

In 2015, jonah bromwich published 10 records for review. The average for this review calculated by hand is: $$ {(7.2+5.5+7.7+6.3+6.6+6.8+7.5+5.9+5.7+6.4) \over 10 } = {65.6 \over 10} = 6.56 $$ In the following SQL query, we'll use aggregate functions to derive the publication counts (COUNT) and their average score (AVG) per year and arrive at the same results as our manual calculation above:

%%sql
SELECT author,
       pub_year,
       COUNT(*) AS count_pub,
       ROUND(AVG(score), 2) AS avg_score
  FROM reviews
 GROUP BY author, pub_year
HAVING author = 'jonah bromwich';

author	pub_year	count_pub	avg_score
jonah bromwich	2012	13	7.04
jonah bromwich	2013	22	6.89
jonah bromwich	2014	22	6.62
jonah bromwich	2015	10	6.56
jonah bromwich	2016	14	7.28

This ensures we correctly aggregate data without relying on SQLite's handling of bare columns. Using aggregate functions properly prevents data inconsistencies and makes results predictable.

3. Querying Two Databases and Truncating the Output

Sometimes, the question that we want to answer may require that we query two or more databases. In this example, we'll write a simple SQL query to return the number of NIPS papers published and number of Pitchfork reviews completed between 2010 and 2015.

To do so, we'll continue to work with the pitchfork database. This is our main database connection. However, we can attach several databases to our main connection with the ATTACH DATABASE SQL command.

Let's now attach the nips database to our main connection:

%%sql
ATTACH DATABASE 'nips.sqlite' AS nips;

* sqlite:///pitchfork.sqlite
Done.

If you make this database connection using the SQLite command line tool and run the .databases command, you'll get the following list of databases:

Let's restate that the name of the pitchfork database connection is main, and the name of the attached connection is nips. To query the databases we use the following syntax:

SELECT * FROM database_name.table_name

To accomplish our task, we'll use a Common Table Expression (CTE) to get our results from the two databases, and then combine the results in our main SQL query:

%%sql
-- Start of CTE
WITH 
    no_of_papers AS (
        SELECT 
            year,
            COUNT(*) AS number_of_papers
        FROM nips.papers
        GROUP BY year
    ), 
    no_of_reviews AS (
        SELECT 
            pub_year AS year,
            COUNT(*) AS number_of_reviews
        FROM main.reviews
        GROUP BY pub_year
    )
-- End of CTE

%%sql
SELECT *
  FROM no_of_papers
  FULL OUTER JOIN no_of_reviews
 USING (year)
 LIMIT 6 OFFSET 23;

year	number_of_papers	number_of_reviews
2010	292	1170
2011	306	1165
2012	368	1185
2013	360	1200
2014	411	1162
2015	403	1135

The LIMIT X OFFSET Y command enables us to truncate our result. LIMIT 6 means we want to return only 6 rows (2010 to 2015 inclusive), and OFFSET 23 means that the first result should start 23 rows from the top (that's from 2010).

Since we're done using this data, let's detach the nips database connection from the main connection:

%sql DETACH DATABASE nips;

And finally, close the main connection:

%sql --close sqlite:///pitchfork.sqlite

4. Run SQL Query from File and Send SQL Query Output to File

You may find that you need to execute a SQL query from a text file (e.g., a .sql file). In this example, we want to query the chinook.db database to get the top 5 spenders of Jazz music from the top_spenders_jazz.sql file.

We can achieve this in several ways. The first way is to run the following from the command line:

sqlite3 -column -header chinook.db < top_spenders_jazz.sql

    full_name         amount_spent
    ----------------  ------------
    Fernanda Ramos    15.84       
    Enrique Muñoz     15.84       
    Hannah Schneider  14.85       
    Astrid Gruber     14.85       
    Kara Nielsen      3.96

The above command executes the query in the top_spenders_jazz.sql file using the chinook.db database file. In the second method, we'll first connect to the database file and execute the query using the sql magic function:

%sql sqlite:///chinook.db

Next, we'll use the sql magic equivalent of the .read command to execute the query:

%sql -f ./top_spenders_jazz.sql

* sqlite:///chinook.db
Done.

full_name	amount_spent
Fernanda Ramos	15.84
Enrique Muñoz	15.84
Hannah Schneider	14.85
Astrid Gruber	14.85
Kara Nielsen	3.96

If you're wondering how the .read command works, it's similar to the second method shown above. You first connect to the database, then read the query with the .read command:

What if you want to send the output of the query as a file? If you have some knowledge of Python, you can:

Assign the SQL query ouput to a variable name
Write the output to a file

# Assign the query output to the variable 'result'
result = 

# Print the result
print(result)

# Write the output to sql file format
with open('names_of_jazz_top_spenders.sql', 'w') as file:
    file.write(str(result))

     * sqlite:///chinook.db
    Done.
    +------------------+--------------+
    |    full_name     | amount_spent |
    +------------------+--------------+
    |  Fernanda Ramos  |    15.84     |
    |  Enrique Muñoz   |    15.84     |
    | Hannah Schneider |    14.85     |
    |  Astrid Gruber   |    14.85     |
    |   Kara Nielsen   |     3.96     |
    +------------------+--------------+

It's even easier to write to a file using the .output command:

The .output name_of_file redirects the standard output to the file. The output of the .read command is now saved in the file. It doesn't print anything to screen. To discontinue writing to the file and start writing to the standard output (or printing to screen), run .output stdout. The file names_of_jazz_top_spenders_cli.sql will appear in your working directory with the result of the query inside.

5. Simplifying Aggregation with Window Functions

You may find yourself writing subqueries and using GROUP BY when aggregating with SQL. The SQL window functions and the PARTITION BY clause simplify these types of complex SQL queries.

In this example, we'll investigate the best selling album in the chinook database. We'll see how to calculate the best selling album using GROUP BY and using window functions.

First, let's create a virtual table, album_track_purchase, with the CREATE VIEW command that will contain only the data we require for our analysis:

%%sql
CREATE VIEW album_track_purchase AS
    SELECT 
        al.album_id,
        al.title AS album_title,
        tr.name AS track_name,
        il.unit_price * il.quantity AS amount
    FROM album AS al
    INNER JOIN track AS tr
        USING (album_id)
    INNER JOIN invoice_line AS il
        USING (track_id)
        ORDER BY 1;

Next, let's preview the view we just created:

%sql SELECT * FROM album_track_purchase LIMIT 5;

album_id	album_title	track_name	amount
1	For Those About To Rock We Salute You	Night Of The Long Knives	0.99
1	For Those About To Rock We Salute You	For Those About To Rock (We Salute You)	0.99
1	For Those About To Rock We Salute You	Put The Finger On You	0.99
1	For Those About To Rock We Salute You	Let's Get It Up	0.99
1	For Those About To Rock We Salute You	Inject The Venom	0.99

To get the best performing album, we need to count how many tracks were sold from the album, the revenue generated, and the percent revenue.

%%sql
SELECT album_title,
       COUNT(*) AS count,
       ROUND(SUM(amount), 2) AS revenue,
       ROUND(100 * SUM(amount) / (SELECT SUM(amount) FROM album_track_purchase), 2) AS percent_revenue
  FROM album_track_purchase
 GROUP BY album_title
 ORDER BY percent_revenue DESC
 LIMIT 5;

album_title	count	revenue	percent_revenue
Are You Experienced?	187	185.13	3.93
Faceless	96	95.04	2.02
Mezmerize	93	92.07	1.96
Get Born	90	89.1	1.89
The Doors	83	82.17	1.74

In the SQL query above, we grouped the table according to the album_title. So COUNT(*) counts all the rows in a particular album. The result is the total number of tracks sold from that album. The SQL query SUM(amount) sums the total revenue generated by the sales of the tracks for a particular album.

To calculate percent_revenue, we need the total revenue generated from a particular album, divided by the total revenue generated from the sales of tracks from all the albums. Since we're grouping our data using album_title, we can calculate the former but not the latter.

To calculate the total revenue generated from the sales of tracks from all the albums, we'll use the following subquery: SELECT SUM(amount) FROM album_track_purchase). Did you notice that we're using an aggregration function SUM without a GROUP BYclause?

Remember that when we do this, we're aggregating over the entire table. Our subsquery correctly calculates the grand total of revenue from the table.

Next, let's see how to do this with windows functions and PARTITION BY instead of GROUP BY:

%%sql
SELECT DISTINCT album_title,
       COUNT(*) OVER (PARTITION BY album_title) AS count,
       ROUND(SUM(amount) OVER (PARTITION BY album_title), 2) AS revenue,
       ROUND(100 * SUM(amount) OVER (PARTITION BY album_title) / SUM(amount) OVER(), 2) AS percent_revenue
  FROM album_track_purchase
 ORDER BY percent_revenue DESC
 LIMIT 5;

album_title	count	revenue	percent_revenue
Are You Experienced?	187	185.13	3.93
Faceless	96	95.04	2.02
Mezmerize	93	92.07	1.96
Get Born	90	89.1	1.89
The Doors	83	82.17	1.74

In the above SQL query, we're partitioning the table with the album_title. Therefore, related album titles are listed next to each other. Since we'll be performing aggregation, we need to return only a single album title, not all the times the title appears in the data, which is why we use the DISTINCT album_title.

The SQL query COUNT (*) OVER (PARTITION BY album_title) counts all the times (rows) a particular album title appears in the table, and the query SUM(amount) OVER(PARTITION BY album_title) sums all the amounts for these rows to get the revenue for that album.

But how can we calculate the total revenue for the entire table? Remember that we used a subsquery to do this previously.

When we use a window function without specifying PARTITION BY inside OVER(), the window function performs the aggregation over the entire table. Therefore, the SQL query SUM(amount) OVER() sums the revenue over the entire table. This is definitely easier than writing a subquery!

Since we're done with our data, let's drop the view we created and close the database connection:

%sql DROP VIEW album_track_purchase;

* sqlite:///chinook.db
Done.

%sql --close sqlite:///chinook.db

Takeaways

In this post, we learned:

How to aggregate data over an entire table
How to aggregate with GROUP BY and PARTITION BY
How to use SQL dot commands
How to read SQL query from file and write output to file
How to use SQL windows functions

We have taken a deep dive into how to write efficient SQL queries and provided an introduction to some advanced SQL functions. What you've learned from this post will help you explore SQL more confidently.

If you want to learn more, enroll in our SQL Fundamentals skill path to bolster your skills. You'll learn how to:

In addition, you'll complete several SQL portfolio projects to demonstrate your SQL competencies to recruiters as you begin your job search. Knowledge of SQL programming will open a host of opportunities for you in the field of data analytics, data science, and data engineering. We encourage you to sign up for one of our SQL courses and begin your SQL Developer journey with us.

An amazing SQL Developer is hard to find, but impossible to forget

from Dataquest https://ift.tt/fVUzwlN
via RiYo Analytics

Page Nav

Ads Place

SQL Queries You Need to Know (with Practical Examples)

https://ift.tt/wKdBpPx According to recent estimates, around 403 million terrabytes of data are produced by humans every day. As this volu...

1. Aggregation Over the Whole Table

2. Understanding Aggregation with GROUP BY and PARTITION BY

3. Querying Two Databases and Truncating the Output

4. Run SQL Query from File and Send SQL Query Output to File

5. Simplifying Aggregation with Window Functions

Takeaways

Related Posts

No comments

Connect WIth Us

Top of the month

Building a Medical Prescription Scanner Using PaliGemma 2 Mix

Building AI-Powered Apps with DeepSeek-V3 and Gradio Using P...

How to Measure the ROI of Your Data Team?

Building a Financial Report Retrieval System with LlamaIndex...

Latest Posts

Val Kilmer Brought a Wonderfully Weird Sensibility to Every ...

‘A Minecraft Movie’ Review: Block by Bizarre Block

A Beginner’s Guide to Google AI Studio

What is Bias in a RAG System?

Val Kilmer, Film Star Who Played Batman and Jim Morrison, Di...

Accounting for attention in speech-to-speech AI applications

AI’s Time Horizon: Can AI Complete Long Tasks?

Natural Language Processing (NLP) with PyTorch

Top 13 Advanced RAG Techniques for Your Next Project

10 Probability Skills You Need for Working with Data

Cloud Labels

Search This Blog

Report Abuse

Contributors

Happy To Help You

Popular Tag

Latest Articles

How to Measure Performance of RAG Systems: Driver Metrics and Tools

Is Qwen2.5-Max Better than DeepSeek-R1 and Kimi k1.5?

When to Use Agent SDK, LangChain, or CrewAI to Build Your AI Agents

Multi-Agent LLM Workflow with LlamaIndex for Research & Writing

Popular Posts

Spider-Man: No Way Home Torrents May Contain Crypto Malware, Cybersecurity Firm Warns

Onecoin Victims Petition Bulgaria for Seizure of Assets and Compensation

3air Leverages Blockchain Technology to Deliver Extensive Broadband Connectivity in Africa

AI Applications for Border Transportation

2. Understanding Aggregation with `GROUP BY` and `PARTITION BY`