Data engineer’s guide to data governance (part 2/3)

Tihana Britvic, Ana Marija Galic

DATA ENGINEERS

In our previous blog, we talked about what is data governance and why we do it, as well as give you some “spicy” details on how much it may cost or how long it can take. You can check it out here. Now it’s time to talk about who does data governance and how they do it.

WHO is involved?
HOW to do Data Governance?
- Are there any tools for data governance?

Who does data governance?

The concept of accountability for data is a key concept for data governance. Knowing that helps us understand why responsibility and accountability in organizations are the key factors in data governance. The organization around DG requires a hierarchy of some sort to enable issue resolution, monitoring, and direction setting, which lead to the existence of various designated roles for data governance. Some of the roles like data owners or data stewards that are found everywhere will be discussed in further detail, however, there are many more that we will not go into, like information owners, and data custodians for example.

Although a“data governance department” term sounds nice and desirable, it does not exist, and should not. The reason is that we want DG to be a part of everyday business operations and by creating a dedicated team, we face possible segregation. Most of the time, the DG organization is a virtual organization made up of business and IT personnel. What we do need to emphasize here, is the need for communication between employees with DG roles. Please note that we are not covering the full DG roles hierarchy here, so we will not talk about things like DG councils, forums, or data management, but please be aware that they do exist.

Data owners

In short, data owners, or sometimes referred to as data sponsors, are (usually senior) people within the organization who are accountable for the quality of defined datasets (one or more). They should make sure that there are set definitions, as well as actions taken on data quality issues. They are also responsible for putting data quality reporting in place (something we will discuss in our next blog). Data owners should be able to fill in or update values in data. For that theywillneeddetailedknowledgeabout the data, as well ashaving access to the current correct values, even if that means sometimes contacting a customer, or providing a deepi nvestigation.

The reason why itis recommended for data owners to be senior employees in the organization is their authority. However, this level of seniority often means that they are unlikely to have the time to be involved in daily activities related to data quality. Therefore, they can be supported by data stewards and data quality managers.

Data Stewards

Data stewards (or data champions) create policies, put them into place, and enforce them, as well as correct data quality issues on a daily basis since data owners don’t have the time. Data stewards-don’t need to be responsible for all the work themselves (i.e., data engineers help with automatization), but they are the ones who should advise others what to do. Data stewardship is a key part of any data governance program which needs the right combination of processes, technology, and people to be effective.

So, what is the difference between data owners and data stewards? Well, as stated above, the data owner will take on the overall ownership of a dataset, but they will not have the time to be involved with the specific activities for keeping the data clean on a regular basis.

A data steward, on the other hand, will be heavily involved in the specifics of how to achieve whatever data objectives were set up, but they need to consult the data owners for the specifics.

Data Producers

A Data Producer is anyone (could also be a department or a person) that creates, updates, or delete data. Normally, they should assure data quality within source systems (i.e., making sure there are no empty fields where they shouldn’t be) but cannot be responsible for the accuracy of the data.

Data Consumers

A Data consumer is anyone who consumes data, either raw or enriched or even provided in the form of reports. Data consumers then use this data for planning or decision making and creating machine learning algorithms for example. Consumers should know who to contact if the data quality is poor.

How?

Before we end this blog, let us look briefly at how to do data governance and list some tools that might help with it. We mentioned in our previous blog that this was a tailored process, so there is no exact algorithm that explains how to implement data governance, however, there are some proposals that we recommend following. We will not go into details since this alone could be a blog post in itself, but we would recommend reading books written by John Ladley.

The entire process starts naturally with an assessment of the maturity of the organization wishing to implement DG, followed by providing a clear vision to the whole organization(including metrics, etc.). In the next step, we would map the business and financial values tog. When that part is done, we can then start with a functional design where the outcomes are policies, principals, and process designs. After that, we can create a governing framework design where we place functional design from previous steps into the organizational framework with complete roles, etc. Only then can we go into a road map step where the details around the “go live” events of DG are planned, (basically how are we going to get from non-governed to a governed state of datasets).

To add to this, here is a vivid picture of how we see this process. Please remember that DG is a circular process, so after its rollout, we can go back to square one to adjust our process. In this picture, you can see some tools (like the RACI model or surveys)that can be used in specific steps, as well as the outcomes of each phase (or concerns in the beginning). This image might be messy, but it creates lots of (nervous)laughs, so it is good conversation piece when we talk about DG in our company. 😊

Are there any tools for data governance?

There are tools that support some parts of data governance, but to be blunt, there are no tools for the whole data governance process.

To name a few, thereis Apache Gobblin for data integration, Informatica MDM or The Profisee Platform for master data management (MDM), as well as data catalogs like Alation or Lumada (Waterline) Data Catalog. From our experience, these tools can often become quite expensive, are not always user-friendly, and on top of that, they do not cover all the needs for data governance. Taking data catalog tools for example – they can ease bringing data’s metadata into one place (table names, schemas, attributes, descriptions), and can even provide fresh data samples, but in the end, it is up to data owners or stewards to explain the data, recognize and manage the confidential attributes, and do all the other work that makes data easily accessible and usable to others in the organization (and sometimes beyond).

What is often also used when establishing data governance processes are different tools or models like surveys, SWOT (strength, weaknesses, opportunities, threats) analysis, RACI (responsibility assignment matrix) models, and anything else we think could help us with DG.

Data governance is not an easy task and the key component of it is human. In the end, people are the ones producing, using, or analyzing data.

In our next blog we will talk more about MDM, data quality, data lineage, and data catalogs, so stay tuned!