https://ift.tt/3kWSeqO Overview of DST and the handling required for data/pipelines Photo by Alexandre Lion on Unsplash Hey folks! S...
Overview of DST and the handling required for data/pipelines
Hey folks!
So, DST or Daylight Savings Time ended recently, on November 7, 2021, and I was on extra alert that day and kept praying that nothing breaks! While verifying the DST impact, I thought of writing the article for DST considerations that needs to be taken care of wrt pipelines, data and jobs twice a year.
Also, don’t worry if you don’t know what DST is, I will explain it below, and trust me it will be an interesting read :)
What exactly is Daylight Savings Time or DST?
DST is a practice of getting more light in the day by changing the standard time to ensure day hours have more sunlight.
At the beginning of spring, on a specific day(usually in March), DST begins in timezones of the United States, Canada, and Australia.
Meaning, the clock is set one hour forward and an hour is skipped from the day. On that day, the day will only have 23 hours instead of 24, and that my friend, is where the trouble begins as a data engineer, in fact, any pipeline owner but I will come to that.
This year, DST began on March 14.
An example of EST time wrt UTC on the DST day(timezones, man!):
If you look closely, hour 2 is skipped and the clock is set forward. So in March, one day will have 23 hours. During spring and summertime, DST leads to more light in the day, and when winter comes time needs to be standard again. (Because what’s the point of DST in winters as we get early evenings?)
At the beginning of winter, DST ends, and the clock is reset to standard time which happened on November 7 this year and on that day the day will have 25 hours:
If you look closely, 1 AM takes place twice when DST ends so the day will have 25 hours but still 24 unique hours(read it again).
Okay, having explained in very details what DST is let’s talk about why it sucks! (Yes, I said it)
How DST impacts the data?
When DST begins(March) and ends(November), on those days, the day won’t have standard 24 hours. So any standard handling, checks, or comparisons might not work for those days. Some impact scenarios are listed below:
Let’s start with data quality checks, suppose you have checks on raw data such that each day should have 24 hours else data is considered “bad” i.e. if a day has 24-hour data only then it will be processed/will be of good quality/complete.
Now when DST begins, as explained above, the day will have 23 hours and such a check of 24-hour data will fail. To handle this, one might need to do special handling.
Another example of data quality failure is “Missing Data Alert” or when an alert is sent when data for a certain hour is missing. On the day DST begins, alerts can be sent for an hour 2(which is skipped) and might need special handling.
Coming to threshold checks which is usually an alert that is sent when data volume exceeds a certain value, now on the day DST ends, data for 1 AM will actually contain data for two hours(see above image again for clarity), and the day as a whole will contain data for 25 hours so again any threshold checks for count/data volume, etc. might fail and require special handling.
I think I have scared you enough of data quality failures, now let’s discuss some of the logic and reporting handling that can be required.
If your logic requires data of the previous/next hour for any calculation, on the DST start and end day it needs to be handled specially so that in March after 1 AM, 3 AM is the next hour, not 2 AM. Similarly, on November, 1 AM again is the next hour after 1 AM.
Also, for these 25 hours in November, any comparisons you have in any reports like an hour over hour comparison, day over day comparisons can show higher numbers(unless you decide to drop data for an hour, which is not recommended)
Another area of impact is scheduling impact. If you schedule your jobs/pipelines timezone specific, then on the day of DST schedules or SLA times might be impacted. Do double-check.
My suggestion is don’t run your pipelines in a specific area time zone if not required and run them in UTC. For final deliverable data surely, do the timezone conversions but at least for all and everywhere, UTC is always consistent.
Well, that’s it! This is all I could think of the impact, right now. If you have any other considerations or impacts that I have missed I will be happy to know. Please do leave a comment.
Hope this helps!
Happy Coding,
JD
Why DST is a Data Engineering nightmare? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/3x7mxQK
via RiYo Analytics
ليست هناك تعليقات