Tackling the challenge of ever-changing data

Ivan Vidosevic

DATA ENGINEER

Having the right schema validation service is the most crucial part of getting your data processing in the right way.

In today’s world, there are huge amounts of data being exchanged every second.

Just look at our smartphones whose apps generate data even when not being used, car sensors that are providing vehicle information to the cloud, and even weather stations that constantly collect and process large amounts of data.

How to play when the rules are constantly changing

Having the right schema validation service is the most crucial part of getting your data processing in the right way. This process makes sure that every little chunk of information that comes from your app, website, machinery, etc.making sure the data is valid before going to the next processing step. will be put into the right box and then sent to the correct processing step.

This schema always needs to be up to date, so you won’t miss anything.

New attributes are generated by your data producers regularly. Think of the regularity that the applications on your phone need an update. All these updates mean that new data attributes become available to the data platforms behind those apps.

This is a good thing, as more data means more insights and more ways to make better business decisions. Getting those new attributes ready for processing on your platform is crucial, because by ignoring them you could lose valuable time, and in a worst-case scenario, even a lot of money.

This is where things can sometimes get a bit tricky.

Notifications can be sent late or not at all, or the purpose of certain attributes can be sketchy. Let’s say you are working with a lot of machine-generated data and these machines just received a firmware upgrade. The result of this upgrade is that these machines are sending a lot of new attributes to your data platform for which their use is completely unclear. By not updating your schema, until you have a complete mapping of the data, you run the risk that vital information on the state of these machines is not available and maintenance will be done too late, potentially causing huge problems.

“Turn and face the strange Ch-Ch-Changes”

In our work with big data, we noticed a familiar pattern with these issues, and we knew it would be something that would reoccur, so we decided to design a product that would put an end to it right at its very core.

That’s how the idea of Janitor was born.

At a very high level, Janitor is a mechanism that makes sure that your data schema is always up to date, so that every new schema is automatically recognized, and the data successfully validated. A schema defines the structure and the type of contents that each data element within the message can contain. For a number of supported data formats, Janitor registers the schema information, keeps track of all different versions, and uses them to validate the messages.

What are the benefits of Janitor?

First of all, it makes sure the data is always up to date. The Janitor recognizes changes in the data schema so the data will never be lost.

Secondly – it has automatic schema registration. Each new schema is automatically added to the database, so the user can keep track of all the schema versions. Finally, ease of use. You simply need to run the deployment script and all the necessary resources will be created, with Janitor ready for use.

Open-source strikes again!

We have decided to open source Janitor to get a response from the community and to spread the word. We wanted to see if others have encountered the same challenges that we have. This way we can get the input about our product so we can receive some feedback and potential ideas to improve the solution.

People usually don’t associate open source products with professional developers, but this is not the case with Janitor. At Syntio we have both beginners and professional developers participating in the creation and development of Janitor. This is because we believe in teaching others to develop their skills by learning from more experienced coworkers. The exchange of knowledge and the subsequent wide spectrum of ideas is an important part of our team dynamics.  

Other misconceptions are that open sourcing is less secure and a legal and licensing nightmare. The project is as secure as the code that is developed, and when open-sourced, the project is granted certain licenses that protect it from theft.

Go Fish

The Janitor is developed for the Google Cloud platform with new versions coming up for Amazon Web Services and Microsoft Azure. The Janitor is available on our GITHUB right now!

For more information check out our WEBPAGE and follow us on LINKEDIN for the latest information on the product.