Labs & musings
Code / 17.02.2022
Nowadays, modern web applications are read-heavy, interactive, and have significant skew. We need to read the same queries multiple times, and because of that we are going to introduce you to Noria; a data-flow system for the storage, query processing, and caching needs of web applications.
Noria is implemented in Rust. Its interface resembles parameterized SQL queries. Its data-flow system pre-computes results for reads and incrementally applies writes.
As opposed to relational databases, Noria has a cache that evicts old infrequently accessed results and writes new ones.
Noria can be compared to popular high-performance key-value store Redis. One of the key motivators for creating Noria as a caching alternative was the fact that, at the time of its conception (2018), Redis did not have multi-threading support (which came with the release of Redis 6). As a result, a key principle built into the very foundation of its design was parallelism, which was enabled through its approach of using a dataflow to in conjunction with materialized views.
We will explain the terms partial state, materialized views, upqueries, and the concept of deltas, which are the main concepts of Noria.
There are two main parts that Noria consists of: base tables and views. Base tables are the equivalent to standard tables in relational databases and views are queries that will be materialized for later use. The concept of materialization will be explained in the following chapter.
Here we have a closer look at how Noria works:
Let’s explain the concept of materialization in the context of Noria. We have already seen that Noria differs from normal relational databases. The primary difference lies in the fact that it doesn’t need to compute the exact same query in isolation again and again, and instead uses a collection of materialized views and the intermediate results that form them, consistently updating them through a dataflow engine (more on this last point later).
Put simply, we can look at materialized views as derived computation results which Noria stores explicitly. These materialized structures are what the user will be querying over.
On this image we can see some of the queries that can be performed in Noria. The views are created here: VoteCount and StoriesWithVC are ones with explicitly given names. VoteCount view is an explicit intermediate result because it is used inside another view, while StoriesWithVC is an external view because it is not contained in any other view. However, under the hood, Noria also stores additional, automatically generated, intermediate results, which are essential for ensuring its caching capability.
However, when we say derived computations, we are not only referring to materialized views but the results of any sub-queries the led to the materialized view’s creation. It is important to point out that these intermediate materializations are also views but their name is assigned automatically to them, and they don’t reflect any named views that we have created in the Noria application. Noria materializes the content of created views and is serving queries directly from these caches.
By doing this, Noria is shifting execution costs from reading to writing which is a less frequent operation.
Now that we have introduced the components of Noria, and explained the concept of materialization, the time has come to present the most crucial part of the Noria application – the dataflow concept.
The one thing we certainly don’t want to happen is inconsistency of data. We don’t want our data to be stale, so we always need to keep our materialized views updated. To achieve this, the creator of Noria introduced the concept of a dataflow into Noria’s core workflow.
First, to understand this concept, we need to understand how Noria looks on the inside. We can imagine it as an acyclic directed graph where base tables are on the top of that graph.
Then we have internal views which are between the base tables and external views. For a better understanding we can say that internal views are those used within other views while external views can be interpreted as leaves in that Noria dataflow tree.
There is one more part of dataflow worth: operators. Inside the graph, we have different kinds of operators through which updated rows are traveling.
Noria supports stateless and stateful operators. Stateful operators have additional indexes to speed up operation and stateless operators avoid re-computation from scratch.
On these images we can see Noria’s dataflow which can be visible in Noria-UI. We created three base tables (stories, votes, users). Here we can see different kind of operators and two external views. A red folder is visible, which is storage for the count operator and is helping Noria to incrementally calculate the new count operator state.
Within the application, we can distinguish two types of actions. The first type are writes or updates, which are done by adding or deleting new records into base tables on the top of the Noria tree. When speaking about updates we come to the concept of positive and negative delta.
Materialized views are based on deltas query which are updates. On a directed acyclic graph, messages that flow over the edges of the dataflow are going to be deltas. A delta is a full row, it has columns just like the rows that are coming out of the operator above, but they also have a sign which is positive or negative. Positive for adding, negative for removing. We can relate Deltas to mathematical multisets.
When an application does write, Noria must insert it in the base table and inject in the dataflow as an update.
An example of Deltas being sent in the dataflow can be seen in Figure 3.:
If a story has 42 votes, and then one vote more comes in, we do not want to execute from scratch each time, we want to have the system which can realize that it can just increment the count of 42 by one rather than recomputing all.
The second action that can be performed is reading. This type of action is performed from the bottom of the tree as we query external views.
We can imagine the Noria dataflow as a graph through which Deltas are flowing.
On this image we can see that Noria is not only creating the materialized views that we are explicitly telling it to create (VoteCount and StoriesWithVC) but also is creating some internal views to speed up the process.
Partial state (‘learning to forget’)
As we already mentioned, by using materialized views, we are caching the data for later use. The problem here is that we don’t have an unlimited amount of memory, and cache eviction is needed. It can happen that Noria decides to evict some data from external views that is used frequently and then we are forced to re-compute these missing states in cache for later use.
This is where the main concept in Noria comes into play, the partial state. A partial state allows materialized views to function like caches.
The partial state is a term which is describing a state with supported missing states inside. Because of eviction, a state that is accessed frequently could potentially be left empty and may need to be recomputed again. This is the part where all the concepts of Noria combine.
First, we will start from external views and have an example to make things clearer. Let’s assume we want to read something from an external view, but that result is missing.
The second step is to look it up in the dataflow and try to find already existing results in the dataflow to update the external view. Nodes in the dataflow with the missing state are generating things called upqueries which are propagated through the graph from the bottom-up. An upquery is a query that goes up to the dataflow graph.
An upquery’s goal is to find a state in the operator that is not missing, and that operator must respond with an upquery response which will then be propagated all the way back to the external leaves. Then, external views will be updated, and we will have our cache ready for incoming queries over the data.
It is Noria’s job to determine which upquery path is most suitable for every materialized view. To determine which upqueries can reconstitute missing states, Noria must trace the view’s parameter back to the upstream state.
In other words, Noria must determine where that parameter comes from.
Once Noria has obtained all the available paths it must decide which path is most suitable for each external view. Noria is using its own heuristics to determine this. It would ideally want to go for points of origin with a smaller amount of data but sometimes ends up choosing bigger ones to ensure there are no misses that would cause slower execution.
On this image we can see that Noria has created paths which it will be using to fill missing states in views. Here we can see that title parameter in external view StoriesWithVC will be obtained directly from base table stories while vcount parameter will be filled from VoteCount internal view.
Partial state and concept of missing states are mostly connected with reading actions while doing write actions.
Disadvantages of Noria
Noria doesn’t support:
- Min/max aggregates
- Correlated subqueries
- Result set offsets
- Non-equality join
- Non-equality query parameters
- Time-windowed queries (now, current time…)
- Top-k queries
Noria’s in-memory storage is unoptimized. Every row in every materialization is allocated in its own vector.
Potential read inconsistency is worth mentioning which may occur while having a view that consists of a table joined with itself, and would be the last disadvantage of Noria.
To help you run Noria locally, we prepared a set of commands you should execute. To fully run Noria, 4 components are required: Noria, Noria UI, Noria MySQL and Zookeeper.
First you need to build the Noria server:
$ cargo build --release --bin noria-server
After a successful build, ensure that ZooKeeper is running:
$ docker run -p 2181:2181 zookeeper
In a new shell run Noria-server:
$ cargo r --release --bin noria-server -- --deployment myapp --no-reuse –address 127.0.0.1 --shards 0 --zookeeper 127.0.0.1:2181 -v
The creators of Noria have created a UI to give a better understanding of the whole dataflow and you can run it by executing following commands from new shell in the noria-ui repository.
Make function is used to generate static HTML files and second command is bash script to start UI on localhost.
The developers of Noria have created a mysql adapter through which queries will be sent to Noria. To run the mysql adapter you should run the following command in noria-myqsl repository.
$ cargo run –release -- --deployment myapp -z 127.0.0.1 :2181
And in a new shell:
$ mysql -h 127.0.0.1 to write sql queries.
Because of the partial state and missing state implemented as a part of it, Noria shifted execution costs from reading data to writing data, making reads much faster. This is main advantage of Noria since it reduces memory usage using eviction on infrequently accessed results. Its usage is appropriate when an application is read-heavy and when the application can tolerate eventual consistency. Caching is better when the data and access distribution are skewed.
Noria can help developers because it is a plug and play application. It is built for general purpose and users don’t need to implement their own caching system for working with a database.
Thanks to Tomislav Domanovac to told us about Noria, and Filip Milić for helpful discussions around finding the best options.
- Jon Ferdinard Ronge Gjengset - Partial State in Dataflow-Based Materialized Views, Februar 2021, MSc, University College London
- Rust at speed — building a fast concurrent database - YouTube
- Thesis: Partial State in Dataflow-Based Materialized Views - YouTube
- mit-pdos/noria: Fast web applications through dynamic, partially-stateful dataflow (github.com)