Labs & musings
Data Mesh Data Mesh
Musings / 08.03.2021
Data mesh is currently a hot topic amongst data architects and as such generates a lot of interest all over the globe. We are brave enough to call it the next generation of enterprise data platform architecture. In this blog we will describeWHATis data mesh,WHYshould it be considered andHOWshould it be implemented.
WHAT IS DATA MESH?
To quote Zhamak Dehghani who came up with the terminology: “a data mesh is a type of data platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design.” In our very modest wording, we can say that data mesh is a type of data architecture, that federates data ownership amongst domains of data owners who are held responsible for providing the data as products, while enabling communication between distributed data across different locations. To be fair, there are a lot of definitions found online but pretty much all share the same idea, which is good. To summarize, we are all on the same page.
Data mesh pillars
Like every popular system, data mesh uses pillars or global variables which provide insights into the importance of data mesh placement by carrying out policies for competitiveness.
The 4 pillars would be:
- Domain-oriented, decentralized data ownership and architecture
- Creating data products
- A self-service data infrastructure platform
- Federated governance
What do data mesh pillars actually say?
Well, although is obvious, the first data mesh pillar advises us not to centralize data ownership. Meaning, in most use cases we have a data warehouse or data lake team who is responsible for data, its usage and access provisioning. Data mesh hereby accentuates that the ownership of data goes to the team who knows the data best and can package it for others in the best possible way. Shortly written, decentralization of data ownership.
The second pillar states that there must be data products. In more or less every company, data is used by data scientists, data engineers, data analysts and business analysts, which in fact a quite a lot of people, with very diverse requirements. Hence for each data usage request, it is good to provide a certain data product by which a specific team can access the data they need.
A self-service data infrastructure platform is the third data mesh pillar. Well, it just makes sense. Having multiple data products can be messy and confusing, not to mention a lot of work. A more practical solution, which data mesh propagates, is to have a unified platform where data products are created and consumed so that the overall data platform consists of one-to-many data nodes.
Data mesh concept
Anyone who has ever expressed any interest in data mesh has probably come across the following diagram.
As the popular saying goes, a picture is worth a 1000 words, and this is probably why. This simple, yet informative diagram shows how a data mesh works in a specific enterprise. If the four pillars are followed, one should end up with an enterprise organization which has anorganizational alignment around creating data products which are built on actual consumption by analysts and data scientists, and are providing an alignment between data engineering and data analysis.
WHY DATA MESH?
Although we are sure that after understanding the data mesh concept, the reader has multiple questions and answers on whether the data mesh would be a good/bad fit for their company, let’s go through why we should at least consider data mesh architecture.
There is a popular notation called the curse of the data lake monster. Heard about it? Good! Now let’s dig deeper. Having a single unified data model is practical to have. If we spent enough time building a robust data infrastructure, use-cases will present themselves later. If we build it, they will come! But will they? Often, data users such as data scientists or data analysts are different teams in a company’s hierarchy. While scientists and analysts are more aligned with business perspective, the data engineers are much closer to the infrastructure behind it. This can already indicate the possible problematic thinking and the tendency to consider a data lake purely as an infrastructure problem. Also, we can’t forget about Conway’s law.
Let’s take a look at the transition diagram. Technology is evolving over time, but it looks like structuring and managing data is staying the same. Whether we are talking about Warehouses, Data Lakes or the Cloud, we keep building a centralized platform to store the data and thinking that is good enough. Someone will use it if we just collect and organize the data for them, so why is that not working out quite as imagined?
On one side, never was so much money and time invested into building data platforms which can provide data driven solutions. But where are the results? Having centralized data infrastructure often causes far more complexity than it should, and also disables the management from seeing the true value of investment. Knowledge? Well, we can also say goodbye to that. There is always a person who has the most experience and knows every single detail in the backend, and surprise, surprise… they often quit. The knowledge transfer disappears from that day forward and you are stuck with months of learning whatever is in the background. So that we don’t miss out on this, how about we rewrite it or migrate the data somewhere else?
With all that being said, a shift to modern distributed architecture sounds quite good now doesn’t it?
This would not be a data engineering blog if we did not mention word monolithic at least a few times. Well, centralized architecture is of course followed with huge monolithic structures on top of it. But building microservices for accessing the centralized data will not help you there.
So, how do we shift?
You guessed it. By switching on a data mesh architecture. But how do we implement it?
DATA MESH IMPLEMENTATION
To mesh or not to mesh: that is the question indeed. We are not saying everyone should switch but who should?
Ask yourself the following:
- How many data sources does your company have?
- How many analysts, engineers, scientists, product managers do you have on your data team(s)?
- How many functional teams rely on your decision-making? Sales, operation, marketing, procurement…?
- How often is the data engineering team a bottleneck for the implementation of new data products?
- How high of a priority is data governance for your organization? (That low? Think again!)
If you score high on most topics, then your organization is in the data mesh sweet spot and you would be wise to join the data revolution.
Moving from data lake to data mesh
Implementation for companies starting fresh is easier than making a digital transformation on existing infrastructure. In that case, you follow mesh guidelines and organize your data accordingly, creating a culture in your company based on mesh principles of data democratization. When it comes to existing enterprises, you would need to go through a transformation. That takes time, energy and focus, but there is a way. There are couple of options (more details can be found on this blog post Data mesh applied).
You can make your architecture full data mesh and decentralize ownership of both raw data and transformed data or you can use a hybrid approach and ‘mesh’ only one. It depends on your resources. In time, you can make changes and decentralize the part that is left.
Decisions, decisions… Yes, this one is a hard one. But let us help you to some extent. First, rate you company by the previously described standards. Then ask yourself:
Do we value serving over ingesting? Discovering and using over extracting and loading? Do we prefer publishing events as streams over centralized pipelines? And finally, do we want an ecosystem of data products over a centralized data platform?
If yes, then data mesh is the right concept for you, and you end up with a paradigm used for breaking down the big data monolith into harmonized, collaborative, and distributed ecosystems, to enable close collaboration between data scientists & data engineers which leads into getting the results faster and more aligned with your business goals, also known as data mesh.