https://ift.tt/Z7CVHDs From someone who made most of the mistakes on the list Most of my data projects usually start the same way: a bunch...
From someone who made most of the mistakes on the list
Most of my data projects usually start the same way: a bunch of SQL queries. And after a few years of experience now, I learned — sometimes the hard way — that it is essential not to mess up this step as even a small mistake can quickly invalidate the results you get down the line.
And there are a lot of ways to make those small mistakes. I am not talking about forgetting a “group by” or dividing something by zero — those ones are easy to spot as your query tool will return an error. I am talking about those silent mistakes, the ones that don’t return any result at all, where you don’t have any warning at all. I am talking about those cases in which you end up believing that you got the output you wanted, but you actually didn’t— and now the query you made doesn’t allow you to properly answer the question you set for yourself.
Below is a non-exhaustive list of those little mistakes, ranked by order of frequency from my own experience. To illustrate them, I’ll use my usual example — Imagine:
- You are working for a gaming company that released multiple free-to-play games (with the possibility of removing ads for paid users).
- Your role is to understand user growth and to do so, you have access to the following 3 databases:
user_status_per_country
user_id
country
is_paid_user
daily_country_revenue
date
country
monetization_source
revenue
daily_country_users
date
country
game_id
users
Let’s dive in!
#1: Assuming the content of a database/field based on its name or description
Let’s start with an example: how would you go about pulling all the free users using the user_status_per_country?
It should be is_paid_user = FALSE, right?
Well, imagine this field is the result of a LEFT JOIN between the main user database and a historical database containing all users that became paid at any point in time and their current status, and as a result, it has been implemented in such a way that it can actually take 3 different values: TRUE (the user is a paid user), FALSE (the user used to be a paid user and is not anymore), and NULL (the user doesn’t exist in the right database — i.e. it has never been a paid user before).
Then is_paid_user = FALSE wouldn’t give you what you need — as that would be returning all the users that paid at least once in the past.
Generally speaking, since most of your work and your results will be dependent on the data source you use, it is important to not rely on assumptions regarding the content of a field/database — especially on assumptions simply based on the naming, and have ways to make sure the assumptions you have are correct.
#2: Over-counting due to duplicate
This one is a usual one. You can run into it when you have tables with multiple rows per value on the fields you are joining.
For instance, using the situation I mentioned in the introduction, if you try calculating the daily revenue per user split per country, you can’t directly join daily_country_revenue and daily_country_users, as you have multiple entries per day for one country in daily_country_revenue (due to the different monetization source) and multiple entries per day for one country in daily_country_users (due to the different game_id).
Concretely speaking, if you write the following:
SELECT
date,
country,
SUM(revenue) AS revenue,
SUM(users) AS users
FROM daily_country_users
JOIN daily_country_revenue
USING(date_id,country)
GROUP BY
date,
country
You will end up over-counting your users and your revenue.
#3: Messing up the conditions
The logical error
Do you know the joke of the programmer who goes to the store after their partner told them “go to the store, get a gallon of milk, and if they have eggs, get 6,” and they return with 6 gallons of milk?
It is the same concept here — you need to make sure the condition you give is very well-stated so that you end up with what you are looking for.
For instance, if you want to pull the paid users that are in France or in Spain, you should write your query as follow
is_paid_user AND (country_code = "FR" OR country_code = "ES")
And not
is_paid_user AND country_code = "FR" OR country_code = "ES"
as this last statement would give you all French paid users and all users from ES. (Side note: the IN statement could also have been used here — but that wouldn’t be helpful for me to illustrate my point)
(Shameless self-promotion: some of you might say at this point that instead of using “OR”, we could use a UNION ALL to improve performance. We’ll talk about that in a follow-up article so stay tuned!).
The “inattention” error
This one pertains to the “attention to details” category — it is important to make sure you are using the right conditions. Especially when dealing with strings, if the string is capitalized for instance. Using the previous example, the following statement:
is_paid_user AND (country_code = "fr" OR country_code = "ES")
…would only return the Spanish-paying users.
#4: Misunderstanding how some functions/operators work
A little quiz: do you know exactly what the usual aggregation functions would do, if given the following parameters?
1. AVG(1, 2, 3, 4, NULL, NULL)
2. AVG(1, 2, 3, 4, 0, 0)
3. COUNT("A", "B", "B", NULL, NULL)
4. COUNT("A", "B", "B", "", "")
5. STRING_AGG("A","B",NULL,"C",NULL)
Generally speaking, it is important to understand how the different functions you use handle NULL values, and how that can impact your results depending on your use case (as in some cases, you will want those NULL values not to be counted in the calculation, while in other you would want them to be considered as 0).
The same thing goes for operators —having a full understanding of how they behave can avoid bad surprises. For instance, BETWEEN is inclusive, but this can get a bit misleading when you start comparing different date formats (see this article for more information), so it is important to make sure you have a good understanding of what the condition should be and how the operator you will use will behave.
#5: Doing a LEFT/RIGHT JOIN with a non-null condition on the 2nd database
Basically, if you do a LEFT/RIGHT JOIN and add a non-NULL condition on a field in the 2nd table, you are kind of defeating the purpose of using the LEFT/RIGHT JOIN — as you are going to be filtering table 1 by a condition against table 2 (while most likely, if you were using a LEFT/RIGHT JOIN, you wanted to have all records from table 1 and having the condition only applied to table 2).
Ok that was maybe hard to understand — this question on StackOverflow illustrates pretty well what I am mentioning here, and what solutions can be used.
This was my top 5 of the best silent errors in SQL — the ones you might not see right away and that can badly impact your work. Hopefully, this story will serve as a cautionary tale and be helpful to some!
There was a 6th one that I came across multiple times and that I hesitated to add to this list: the faulty implementation of a window function. Ultimately, I believe window functions deserve their own separate article, so I will go back to this one in a follow-up story.
Hope you enjoyed reading this piece! Do you have any tips you’d want to share? Let everyone know in the comment section!
And If you want to read more of me, here are a few other articles you might like:
- 7 tips to make your data analysis more robust
- How to Build a Successful Dashboard
- How to… choose which data projects to work on
A Non-Exhaustive List Of ‘Silent’ Mistakes in SQL That Can Ruin Your Analysis was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium
https://towardsdatascience.com/a-non-exhaustive-list-of-silent-mistakes-in-sql-that-can-ruin-your-analysis-e5c62e6db489?source=rss----7f60cf5620c9---4
via RiYo Analytics
No comments