Materialize is a really interesting solution, and I love for what it stands. But...

benesch · on March 1, 2021

What you're asking about is the magic at the heart of Materialize. We're built atop an open-source incremental compute framework called Differential Dataflow [0] that one of our co-founders has been working on for ten years or so.

The basic insight is that for many computations, when an update arrives, the amount of incremental compute that must be performed is tiny. If you're computing `SELECT count(1) FROM relation`, a new row arriving just increments the count by one. If you're computing a `WHERE` clause, you just need to check whether the update satisfies the predicate or not. Of course, things get more complicated with operators like `JOIN`, and that's where Differential Dataflow's incremental join algorithms really shine.

It's true that there are some computations that are very expensive to maintain incrementally. For example, maintaining an ordered query like

    SELECT * FROM relation ORDER BY col

would be quite expensive, because the arrival of a new value will change the ordering of all values that sort greater than the new value.

Materialize can still be quite a useful tool here, though! You can use Materialize to incrementally-maintain the parts of your queries that are cheap to incrementally maintain, and execute the other parts of your query ad hoc. This is in fact how `ORDER BY` already works in Materialize. A materialized view never maintains ordering, but you can request a sort when you fetch the contents of that view by using an `ORDER BY` clause in your `SELECT` statement. For example:

    CREATE MATERIALIZED VIEW v AS SELECT complicated FROM t1, t2, ... -- incrementally maintained
    SELECT * FROM v ORDER BY col LIMIT 5                              -- order and limit computed ad hoc, but still fast

[0]: https://github.com/TimelyDataflow/differential-dataflow

Nican · on March 1, 2021

Thank you for taking your time to write that up. To illustrate my point a little more:

Suppose a customer has several resources, and each resource has several metrics. From my understanding, Materialized could be used to have an aggregated view of metrics per customer.

The problem is that resources can also be migrated between customers. When a resource migrates between customers, the whole history of the customer changes. This could cause huge updates depended on how many resources are moved, or how many metrics per resource are being collected.

I have a conundrum between doing the "customer-resource join" late, and causing huge CPU cost when running queries. Or making aggregates early, and then having huge Disk cost when migrating resources. At the moment, we just have daily jobs that aggregates the TBs of customer data daily, because there is no way to do the joins in real-time.

Is Materialize designed to be able to handle something like this?

benesch · on March 2, 2021

So, roughly speaking, I think you're describing a view that could be defined in Materialize like so:

    CREATE MATERIALIZED VIEW customer_avg_latency
    SELECT customer_id, metric_id, avg(metric_val)
    FROM metrics
    JOIN resources ON metrics.resource_id = resources.id
    JOIN customers ON resources.customer_id = customer.id
    GROUP BY customer_id, metric_id

There are various ways to slice and dice that query as a user that'll allow you to choose whether to do the aggregate early or late. The details, I think, depend on the specifics of your data model.

> When a resource migrates between customers, the whole history of the customer changes.

I'm a bit confused about this part. In the query I posted above, Differential's incremental join algorithm would spare you from recomputing the entire join. (As written the query would have to re-aggregate the metrics associated with that resource that moved, but that could be solved by manually pushing the avg down.)

If you'd like to drop by our Community Slack (https://materialize.com/s/chat) or GitHub issues (https://github.com/MaterializeInc/materialize/issues) we'd be happy to talk through things in more detail!

glogla · on March 1, 2021

At the moment, it does it by not allowing updates at all.

It also forbids some parts of SQL that could get you update-like functionality, like appending new versions of records to a stream and then running

    select * from (select *, row_number() over (partition by pk order by updated_at desc) from stream) where row_number = 1

or something.

(Boy I wish there was less awkward way to do this.)

EDIT: I'd love to ship bunch of data from few CDCs to Materialize and then run realtime reporting on that but without updates or window functions, Materialize can't do that just yet.

EDIT2: The part about updates isn't true, see below.

jldlaughlin · on March 1, 2021

We (I work at Materialize) actually do support updates! Our CDC sources support them, as well as any source using an UPSERT envelope (more info here: https://materialize.com/docs/sql/create-source/text-kafka/#u...).

As per your second point, I have a less awkward way for you! Materialize supports a top-k idiom (https://materialize.com/docs/sql/idioms/#top-k-by-group) that is hopefully a bit more clear.

glogla · on March 1, 2021

Right, I saw no update in the SQL keyword list, and saw some (I guess old now) talk about Materialize where the speaker mentioned no updates, sorry!

jldlaughlin · on March 1, 2021

No apologies necessary, we're an active work in progress! (And, hopefully, moving quickly!)