Marija Sokcevic, Dominik Kos
DATA ENGINEERS
Introduction
We’re thrilled to announce the first stable release of our product, Publisher. Publisher is a powerful and innovative solution designed to help any company with large amounts of data to cut down the cost and the latency of moving data from various systems to any of our numerous supported message brokers, all while giving you a real-time, constant flow of ready to digest data packages. In this technical blog, we’ll share all the details about our new product and why we believe it can make a difference for our customers.
What is Publisher?
Publisher is a patent pending software solution that transforms your data into business objects at source and delivers it to your downstream systems or analytics platforms at near real-time, for a fraction of the cost that comparative softwares on the market do. Using Publisher to quickly create value from data that you own, enables you to seamlessly incorporate Publisher into your Event-Driven (Event Mesh) or Data Mesh platform architecture.
Publisher is built on cloud native technology and as a result works on all the major cloud providers. In fact, since Publisher runs on container technology, as long as you have the ability to support containers, Publisher is just as happy running in your data centre, whether it be on Kubernetes or as a lightweight standalone Docker deployment.
What can Publisher help you with?
Moving data from on-premise systems to the cloud is a costly, time-consuming process. Expensive CDC solutions are the current go-to for this kind of operations. These solutions, however, don’t offer further data processing, such as the creation of business objects. Most of them are also very slow and come with a certain level of instability. To put in perspective how much faster Publisher is, we performed several benchmarking tests with Publisher against the leading CDC solution on the market and our tests showed that Publisher is approximately 50 times faster than the existing CDC solution. All of that while being much more lightweight.
Publisher is the „middle man“ between on-premise systems and the cloud which also does the required processing for you. It provides a stable, fast, and efficient way to format your data and publish it to the cloud for further consumption. It is capable of processing and transporting historical data, as well as real-time data.
Publisher as such can be a good starting point for constructing your data platform, since it provides scalability, security, global accessbility, and all that while enabling business continuity.
If your data is already in the cloud but you need a means of transferring it to a different cloud due to the the company or the client opting for a different cloud solution, Publisher has got you covered.
Migrating data from SQL databases or APIs to your NoSQL databases has also never been made easier. Simply connect Publisher to your source, which in turn then delivers your data to a message broker of your choosing, and from there simply sync the values into your NoSQL databases. It’s that simple!
How does it work?
Publisher is developed as a combination of decoupled microservices. These microservices represent the following components: Metadata database, InitDB script, Manager, Worker, Data Fetcher, Scheduler, Avro Schema Generator, and Web UI.
The following diagram gives an overview of Publisher’s deployment process and the end result of said process:
Publisher Architecture
These components get packaged as microservices using Docker. These Docker images then get stored in the Docker registry of your choice, e.g. Docker Hub. The deployment scripts use these images to deploy all the components to a Kubernetes cluster, or run them as a standalone Docker deployment.
Overview of the Components
Metadata database
Publisher stores configurations in a Postgres relational database. It stores the configurations for the data source (mainly connection details for a database of supported type), destination (message broker configurations), the Publisher instance configurations, as well as the information about each run of all active Publisher instances.
InitDB script
InitDB is a script that is executed upon Publisher’s deployment. Internally, it’s implemented as a Kubernetes job. It creates the metadata database tables, views, and constants (supported source types, destination types, and admin user credentials for Web UI), which are required for Publisher’s normal execution.
It doesn’t create any source, destination, or instance configurations. These need to be explicitly added by the user.
Manager
The Manager component is a REST server that exposes API endpoints used for configuration management.
It supports CRUD (Create, Read, Update, Delete) operations on the metadata database for the 3 configuration types. More on the setup of the configuration files necessary to start a Publisher instance, can be found in the following section Publisher Configuration Files on the Publishers official documentation page.
It communicates with Publisher’s Web UI and Java Fetcher. Publisher’s Web UI uses Manager’s API to get statistical data about the running Publisher instances. On startup, the Manager sends connection details for all source databases to Java Fetcher. Java Fetcher stores these connections for future use.
Worker
The Worker component consists of several subcomponents. These subcomponents represent the transformations of data that are done by Publisher. They are logical units in a single microservice.
Worker subcomponents
One Worker component is created for each active Publisher configuration. There can be multiple active workers with different configurations simultaneously. Once a Worker is created, it processes data in a loop until a previously defined stopping point or until stopped by the user.
Data Fetcher
The Data Fetcher subcomponent represents the starting point in the data pipeline and is essentially used for fetching data from a database source or an API. The source database or API is queried and the results of this query are stored, along with the relevant metadata. The query itself is defined during the instance’s creation. The query needs to include a time interval, which is used to internally calculate what data to fetch in a particular Worker run. Users can specify extra parameters to fine-tune the performance of data fetching.
Payload Formatter
Payload Formatter formats data fetched from a database into business objects. The business object definition is defined during the instance’s creation. It defines the business object’s structure and fields. For more information on how to properly format business objects using Publisher, please view the Usage – Instance section on the Publishers official documentation page.
Serializer
The Serializer subcomponent is responsible for data serialization. Publisher offers support for two commonly used serialization types: Avro and JSON.
Encrypter
The Encrypter subcomponent uses the AES (Advanced encryption standard) algorithm in GCM (Galois Counter Mode) mode to encrypt data that will be sent to the message broker. The encryption can also be disabled in the instance configuration, in which case plain text data will be sent further down the pipeline.
Publisher
The subcomponent that gave Publisher its name, it takes formatted messages and sends them to a message broker topic. The supported message brokers are GCP PubSub, Azure Service Bus, Apache Kafka, Solace PubSub+, Pulsar, MQTT, Core NATS and Nats JetStream.
Users should be aware of specific message broker limitations concerning message size and processing speed. Users can provide broker-specific parameters for performance fine-tuning.
In case of an API source, one Publisher run produces exactly one message which contains all the data fetched in that run.
Scheduler
The Scheduler component is in charge of managing Workers. It creates one Kubernetes Pod for each active Publisher instance configuration. It destroys these Pods when the Worker’s job is complete or an error state is reached. This makes Publisher a true Kubernetes-native solution: dynamic and scalable.
During its periodical checking of Kubernetes pods status, the Scheduler component ensures that all the pods that should be active are active. If a pod breaks and crashes for some reason, the Scheduler will create it again if it is supposed to be running.
The Publisher can either publish data constantly or publish increments of data on a given cron-based schedule. For more information regarding this, please view the Usage tab of the Publishers official documentation page.
Avro Schema Generator
Self-explanatory component that generates the Avro schema of the messages during the first run of a new Publisher instance that uses Avro serialization.
WebUI
Visual tool used to monitor the performances of Publisher instances, as well as create, update, or delete new, or existing Publisher instances.
Publishers Web UI Home page
Publisher’s Web UI is an exposed component, meaning it can be accessed outside the cluster. The communication between Publisher’s Web UI and Manager is secured by using the HTTPS communication protocol.
This is just a brief overview of the UI component and if you wish to learn more about it, please view the Web UI usage section on the Publishers official documentation page.
Benefits of Using Publisher
-
Speed & Simplicity: While developing Publisher, the user was always kept in mind. We wanted to develop a component that is easy to install and easy to operate. Because of this setup, Publisher can be implemented into your platform architecture with great ease, making sure you lose as little time as possible.
-
Maintenance & Uptime: By decoupling your data from your systems, Publisher makes your organization less dependent on these systems. No more worries about putting too much strain on your database, system downtime, or limited windows for deployment – the data needed is always available and ready to use. This removes the strenuous task of working with legacy systems and enables developers to not have to worry about them, but rather to focus their time on the new features they are developing. This also leads to a vast improvement in the time to market for new features.
-
Cost of Ownership: Using Publisher can drastically reduce the day-to-day costs of your data migration. Due to the nature of how Publisher transforms and encrypts your data at the source, it can send these data packages over the public internet, essentially removing the need for customer server setups or private network communication. This also means that no additional server space is needed to run Publisher, reducing the initial investment and maintenance costs.
-
Enablement: The way Publisher transforms your data and makes it available enables a loosely coupled architecture and prepares your organization for the future. Next to the decreased level of maintenance needed due to its small amount of dependencies, Publisher also improves the ability to meet future use-cases. Being able to develop applications regardless of the legacy system restrictions results in an added business value. It also creates the opportunity to reevaluate current use-cases, improve on them and increase their longevity.
Don’t Wait – Begin Your Business Transformation Journey Today!
Ready to experience the benefits of Dataphos Publisher for yourself? Head over to our website now if you want to learn more about Syntio and how we can help you in your data journey (Homepage) and if you want to learn more about Publisher and how to get started visit the Publishers documentation page
(Publisher).
Both community and enterprise versions of Publisher are at your disposal, with the enterprise edition offering round-the-clock support, access to new feature requests, and valuable assistance for developing use cases to drive your business forward.
We’re excited about the potential of Publisher being able to help companies maintain a continuous flow of easily consumable data packets across their cloud infrastructure, and we hope you are too. Thanks for taking the time to read about our new product, and we look forward to hearing about how Publisher has made a difference in your life!