Labs & musings
Why your data pipelines need a fail-safe Why your data pipelines need a fail-safe
Product / 28.01.2021
Elaborate data processing pipelines are relied upon to work like clockwork.Butmuch like clocks, theyalwayshavea tendency tobreak.
The importance and sheer volume of data being producedon a daily basisis nothing short of incredible.We areat a point where we can tell an entire person’s story through it – their likes, their dislikes, the people they interact with, the things they want, the things theydon’teven KNOW they want – andthat’sjustoneperson. When we scale it up to billions, the fact that data-exchange flows as well as it does is nothing short of an engineering miracle.
There is always a but…
Of course, what separates an engineering miracle from an actual miracle is its imperfections. And it has quite a few. Elaborate data processing pipelines are relied upon to work like clockwork.Butmuch like clocks, theyalwayshavea tendency tobreak. Be it miscommunication between teams,badly formatteddata, unpredictable server failures orjust goodold-fashioned bugs, the fact of the matter is that, in a field like data engineering, something going wrong isalmost agiven.
And as said pipelines slowly migrate to platforms that are evolving faster than them, our priorities should shiftfrom standard prevention of failures to mechanisms allowing swift and accurate recovery from them.
Persistoris just such a tool.
Tackling the root cause
We haveworked as a consultancy firmfor several years now, workingfor many different clients. In that time,wecame to realize there were a lot of overlapping – if not downrightidentical– issues between all of them. And instead of giving the same advice and rebuilding the same toolsover and over again, we decided to strike the issues at their core.
WithPersistorin particular, itwasloss of messages–raw data– due to some part of the pipeline failing. Messages that, for one reason or another, could not simply be re-sent, meaning the loss was permanent. And given theaforementionedimportanceand volume of data, even a “minor” loss amount-wise could turn out to beamajor one in the grand scheme of things.And that was unacceptable.
Sometimes itjustbetter toK.I.S.S.
The ultimate solution to the problem isreally assimple as thePersistoritself. If we don’t want to lose messages – we’ll back them up somewhere as they arrive to the pipeline!
All thePersistordoes, at its core, is store the raw message payloads to storage blobs as theyarrive through your cloud messaging services.It’scheap,it’sfast,it’sefficient andit’smodular. The last part being ofparticular importanceto us: you should think ofPersistorasan additionto already existing pipelines. It works completely independently and does not disrupt whatever workflows are already running.
It does not alter your data:Persistorreads.Persistorwrites. That’sall there is to it.
For data’s sake…
While the obvious benefit is that it makes foran efficientfail-safe, the practice of storing raw data has undeniable benefits to begin with. Data analysis and data science rely on extracting and processing certain parts of the data, tossing the other ones aside as redundant. However, years ormaybe evenas soon as months down the line, as your business and clientele expand, you may come to realize that redundant data may not have been soredundant after all. And if youdon’thappen to have the original data – well.It'sunfortunate, to say the least.
That’snot even mentioning the benefit ofsmoother communicationbetween clients and developers.As a developer myself, I know that Ican understand and act much faster on a client’s demands whenIhave a firmer grasp of what is andisn’tpossible to do. And seeing the actual data is the most important part of that! Not to mention, clients themselves can berecommendedhow the information gathering could be expanded in useful ways.
Maybe we’re stating the obvious here,but it’s important to understand that a lot of this stuff is still in its infancy, and a lot of the more “obvious” things still haven’t been recognized and set in stone. But we’re hoping that, withPersistor, at least one of those things now will be.
It should also be noted thatPersistoritself ishighly-scalable,due to the fact thatit’sbuilt as aserverlesscomponent.We designed it specifically around Cloud functions for GCP and Azure functions so the provided service can takefull advantage of their scaling mechanisms.And as we slowly work towards a standalone dockercomponent, users themselves will be able to define and control the level of scalability, depending on their needs and systems.
Finally – and perhaps most importantly –Persistoris anopen-source project.
Persistorisjust the first step in a much larger goal of advancing our community of data engineers. And, in turn, we hope that they’llreach back and help us improve wherever possible. That is why the open-source approach was so important to us.
Of course, one of the most common misconceptions aboutopen-sourceis that the code is free until we decide it suddenly isn’t. We want to assure everyone that is not the case – we are using the Apache license that guarantees you can use thePersistorfrom now, until the end of time, without paying us a dime. This is a tool built foryou. Stopping or disincentivizing you from using it defeats the point.
ThePersistoris designed for GCP and Azure messaging services. Both variants are availableon our Github!