Page Nav

HIDE

Breaking News:

latest

Ads Place

EDA for a Brighter Data Culture

https://ift.tt/3s3aq6u It’s not just a step of model development, it’s perennial documentation Explore that old data and find the light. ...

https://ift.tt/3s3aq6u

It’s not just a step of model development, it’s perennial documentation

Explore that old data and find the light. Photo by Joshua Sortino on Unsplash

Exploratory Data Analysis (EDA) is the first look at a dataset: a way to understand the variables, their distributions, and their relationships. It should be the first step of a thorough Data Science project, and a handbook to easily understand the domain of the data.

Thing is, EDA is usually a one-and-done deal: create the report and compile our insights, that’s it. We do it for a fleeting purpose, becoming one of many Notebooks that end up in a lost repository, a sprint card in your project management tool.

So, how can we wring out the maximum impact from our explorations? The answer does not lie on technicalities, but rather on a shift on your abstractions: EDA is onboarding and documentation.

The Bus Factor Hits Harder On Data

For the company, the most important asset of data scientists isn’t their modeling chops or their toolkit: it’s their domain knowledge. They’re the ones who should know the intricacies of all datasets under their command and the ones who are in the best position to formulate insights over the usage of that information.

By working daily with the many company-specific datasets, the data scientists start aggregating domain-specific insights that aren’t raw code artifacts, AI modeling, or architecture: they start understanding the connection between variables, the reasoning behind feature selection outputs, the joining points between all tables.

When a data scientist leaves a team and that knowledge isn’t adequately propagated, the data culture of the business is crippled. It’s a structural void that is incredibly hard to fill, triggering major refactor efforts before the team regains its old velocity.

This is a bitter reality in all areas of IT: the tale of key personnel leaving and everything they coded is rendered untouchable can be told everywhere from backend to frontend. The paltry documentation left by the ones that came before becomes the ancient scroll of truth revered by all.

But while the damage can be mitigated in other areas by layers of code documentation and process descriptions, the domain knowledge lost in data science isn’t easily recoverable.

I know that this dataset is skewed, and this needs this preprocessing function before that specific outlier detection… But how can I pass this know-how forward?

Essential Documentation for Analysts (and Scientists)

The document that usually holds all secrets about the data itself is the result of Exploratory Data Analysis (EDA). It is the structured statistical investigation of a dataset where a data scientist analyzes the integrity of the data, finds relationships between variables, and summarizes the gathered insights in didactic graphs and plots. The article below has a practical example of what an EDA report usually contains.

What is Exploratory Data Analysis?

Most articles and books around will tell you that the exploration of a dataset is an important step in the Data Science model development cycle. It definitely is important, but it shouldn’t be just a step! The EDA report contains information that is invaluable to keep the data culture of a company alive and kicking, and brings newcomers up to speed with all the intricacies of any dataset. It can be used:

  • to explain how an existing model works;
  • as a base for new prediction model creation;
  • as a reference point for Data Analysts.

So, all you need to do is repurpose any reports from data science projects as a documentation source. While this looks easy from a managerial standpoint, it’s very hard to internalize the practice in your company.

Present and Record

Making documentation is hard and boring. It’s the job you do when your cloud provider goes down or when your sprint goals need some padding. It’s also absolutely necessary for the continued velocity of any development team.

"Good Code Documents Itself" And Other Hilarious Jokes You Shouldn't Tell Yourself

Data science documentation is extra hard because it involves math and statistics. Putting data insights into coherent word strings is an art, and therefore it’s very much uncommon in our practical programming world.

But one thing we’re accustomed to is presentations. The job of the data scientist often involves explaining technical lingo to laypeople from sales, management, directly to clients… This makes presentations a part of the daily life of the Data Science team.

Then let’s add another presentation to the mix: the dataset presentation. Every time there’s a new dataset or a new data domain in your company grab the programmers that sourced it together with the Data team and make a small presentation mixing the origin of the data, the architecture behind eventual updates, the business interest in this data, and the overall statistical structure.

Why would you mix the sourcing and the statistical studies? That’s a personal choice of mine, based on experience: when Acquisition and Analytics don’t talk, pipeline diagrams get muddy. And if there’s one connection you don’t want to break, is the one with the root of your analysis.

Record that presentation and propagate it to every related internal Wiki page in your organization. Make it an onboarding checkbox so everyone involved in the process has to have seen it at least once. And always refer the video to people that are interested in the whole modeling part of the company!

Culture is abstract. It’s hard to keep knowledge and process alive in your team, let alone in a big company. The only way to create a habit is through sheer force of will and repetition, and I believe that using EDA to persist Data insights is a valid strategy to help your team go faster and be resilient!


EDA for a Brighter Data Culture was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/3rVfTfN
via RiYo Analytics

No comments

Latest Articles