https://ift.tt/WzfGuZi Building robust and scalable data platforms using data mesh Photo by kirklai on Unsplash There is a growing dem...
Building robust and scalable data platforms using data mesh
There is a growing demand for harnessing business insights from data, forcing companies to invest into infrastructural and personnel resources to transform their organizations to become data-driven. As a result, organizations are becoming more data savvy, and some have even started to reap its benefits. Despite these promising initiatives, the much needed cultural shift in becoming data driven is still lagging in many organizations. A recent survey of 94 firms across multiple sectors reported that only 26.4% of them have successfully created a data-driven organization with over 90% of the executives identifying that shifting the company culture to be more data-driven has been the bottleneck [1]. The focus of this article is to present some of the ideas, from an infrastructural perspective, that has been put forward in the recent years to radically transform an organization to be more data-driven.
We will begin our discussion on how the present day data infrastructures are built to fail, followed by a glance into what we can learn from modern-day software development methodology, and use that as a segue to introduce data mesh, a data architectural framework to future-proof data infrastructures, facilitating organizations to be more data-driven.
What are we doing now?
At a high level, an archetypal data infrastructure at many organization these days is made of three constituent parts; data source, data repository and data users. Data source represents the microservices and databases that generate the data, i.e. producers of data. A data repository could either be a data warehouse or data lake into which data is stored via data pipelines. Data users are generally individuals such as business intelligence analysts and data scientists who are looking to generate value from these data repositories, i.e. consumers of data. This has been the standard of operation since the advent of large-scale centralized data repositories with modifications as technology evolved.
To an extent, the conventional data infrastructural framework has worked for a very long time. In fact, centralized data repositories brought in more rigor into how data is managed within an organization. We have also witnessed an explosion in technical tools for creating and maintaining these data infrastructures. However, the lack of flexibility in these centralized data repositories is really impeding operating at scale as businesses evolve into new and diverse sources of data. It also creates silos within the organization where you would have task oriented teams working on a specific task without any understanding of upstream and/or downstream processes. This would ultimately result in disconnected execution, slower pace of innovation, and lack of accountability within an organization, which is not an ideal outcome when the objective is to foster a data-driven culture. Lastly, as a result of this centralized framework, both producers and consumers of data end up relying on a rich and diverse tool stack to provide and query the data to and from these centralized data sources, respectively, increasing the total cost of building and maintaining data infrastructures substantially. In summary, the focus of many data platforms of today is starting to become an exercise in moving data around by hyper skilled data professionals working in silos instead of working together in a more cohesive manner to extract value and business insights from it.
This has happened before
Before we introduce the solution to this problem, let’s look at a similar problem that existed in software engineering realm. Just over a decade ago, it was a common practice among software developers to use monolithic architecture, which is basically applications built as a single unit comprised of three layers; database, user interface and server-side application. In other words, applications were built using one large codebase that captures all of the business logic. This practice has been around for decades and was proven effective for building functional and secure applications. However, as the business grew, this turned out to be ineffective to scale, both in terms of updating an extremely interlinked codebase and possible financial setbacks from having to scale vertically. There could also be external factors such as updates to framework used in the application, which could inadvertently create compatibility issues within multiple parts of the codebase, ultimately resulting in application failure.
To circumvent these issues, microservice-based software development architecture emerged where a number of self-contained microservices that represents a business logic is encapsulated into an application. Microservice-based software development architecture was motivated by Domain Driven Design (DDD) by Eric Evans, which in a nutshell says that each microservice in an application is created by people who understands it, i.e. the domain experts. The benefit of this architecture is that it allows horizontal scaling, which is much more efficient than its counterpart for a myriad of cost and time saving reasons. More importantly, microservice-based software development architecture means you can now apply agile product management frameworks to deploy and maintain applications in a reliable and efficient manner. One of the early adopters of microservice-based software development architecture was Netflix who migrated to a cloud-based microservice software development architecture to keep up with the growing demand on their video streaming services [2].
Paradigm Shift in Designing Data Infrastructures
Few years ago, Zhamak Dehgani presented a conceptual framework for designing the next generation of data infrastructure called Data Mesh [3–4]. Data Mesh was presented as a sociotechnical approach to revolutionize how data infrastructures are designed. Going back to our analogy about how the emergence of microservice architectural framework transformed software development, data mesh was proposed to harness this paradigm for building data platforms of future. It was a confluence of distributed domain driven architecture, self serve platform design and product thinking. To put that in simple words, the idea was to transition from centralized data repositories into cross functional domain oriented teams where each business unit (human resources, marketing, sales etc.) treat data as a product with the common objective of serving secure data to the rest of the organization in a usable manner via a common infrastructural platform.
The underlying benefit of data mesh is that you can decentralize the data ownership with accountability enforced to the team (business unit) that is responsible for that data product. This would ultimately increase the accessibility of high quality data, presenting an opportunity for data consumers to gather insights more efficiently, accelerating the pace of innovation in an organization. Furthermore, the flexible and decentralized nature of data mesh means as the organization and sources of data grow, a data mesh pattern can easily scale its operation with ease. To sum it all up, through product thinking and domain driven architecture, data mesh empowers individual business units to be more data driven, empowering data consumers to manufacture business value through rapid experimentation.
Currently, there are a handful of organizations that has adopted some elements of data mesh into their practice. For instance, Zalando, Europe’s biggest online platform for fashion, adopted the principles of data mesh to address some of the issues they were dealing with centralized data lakes to simplify data sharing [5]. In their effort, they introduced a concept called ‘Bring Your Own Bucket’ where data producers can plug in their data into the centralized data lake. Data consumers on the other end can access these data using a centralized data infrastructure layer. At the same time, the flow of data within the organization was overlooked by a centralized governance layer. In addition to these infrastructural changes, Zalando also applied behavioral changes to decentralize data ownership while ensuring optimal data quality within the organization. Similar outcomes were also reported at Saxo Bank, United States Department of Veterans Affairs and Roche [6].
Looking Ahead
Data mesh is not a technical product that you can implement out of a box. It is just a set of principles that serve as a blueprint for building robust and scalable data platforms. It presents an opportunity for executives and data professionals to harness software engineering rigor for building the next-generation of data infrastructures, fostering data-driven culture across the organization. However, every organizational data needs are distinct and as such will need modifications to the proposed data mesh model to fit their data strategy. At the same time, if an organization is in the early stages of their data journey, i.e. low data and/or engineering talent maturity, implementing data mesh may not have a tangible business case for them. Nevertheless, it might still be worthwhile to be mindful about the principles of data mesh as a means to preempt and address any potential bottlenecks that may arise as the data needs evolve. Lastly, given its early stages, operationalizing data mesh would require considerable capital and personnel investments alongside coordinated centralized efforts for building and maintaining the required distributed infrastructure.
References
- Partners, N., The Quest to Achieve Data-Driven Leadership: A Progress Report on the State of Corporate Data Initiatives, in Data and AI Leadership Executive Survey 2022.
- Harris, C. Microservices vs. monolithic architecture. Available from: https://www.atlassian.com/microservices/microservices-architecture/microservices-vs-monolith.
- Dehghani, Z. How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh. 2019; Available from: https://martinfowler.com/articles/data-monolith-to-mesh.html.
- Dehghani, Z. Data Mesh Principles and Logical Architecture. 2020; Available from: https://martinfowler.com/articles/data-mesh-principles.html.
- Databricks, Data Mesh in Practice: How Europe’s Leading Online Platform for Fashion Goes Beyond the Data Lake. 2020.
- Review, H.B. Beyond Technology: Creating Business Value with Data Mesh 2022; Available from: https://www.thoughtworks.com/en-au/what-we-do/data-and-ai/data-mesh/creating-business-value-with-data-mesh-whitepaper.
Data Mesh: Future Proofing Data Infrastructures was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
from Towards Data Science - Medium https://ift.tt/69GT28K
via RiYo Analytics
No comments