Introducing dbt + Materialize

jdotjdot · on March 1, 2021

I have been waiting for this since the moment I first read about Materialize a year or two ago. I think there's still a lot of work to be done, but at heart, if you can pair technology like Materialize with an orchestration system like dbt, you can use dbt to keep your business logic extremely well organized, yet have all of your dependent views up to date all of the time, and use dbt even to use the same analytical layered views both for analytical AND operational purposes.

The biggest issue I see is that it requires you to be all-in on Materialize, and as a warehouse (or as a database for that matter), it's surely not as mature as Snowflake or Postgres.

arjunnarayan · on March 2, 2021

Thank you for your kind words! We indeed have plenty of work to be done (and are thus hiring)! I'm curious however why you think this requires you to be all-in on Materialize. As you said better than I could have, dbt is amazing at keeping your business logic organized. Our intention is very much for dbt to standardize the modeling/business logic layer which allows you to use multiple backends as you see fit in a way that shares the catalog layer cleanly.

Our hope is that you have some BigQuery/Snowflake job that you're tired of running up the bill hitting redeploy 5 times a day, and you can cleanly port that over to Materialize with little work because the adapter is taking care of any small semantic differences in date handling, or null handling, etc. So Materialize sits cleanly side-by-side with Snowflake/BigQuery, and you're choosing whether you want things incrementally maintained with a few seconds of latency by Materialize, or once a day by the batch systems.

My view is you're likely going to want to do data science with a batch system (when you're in "learning mode" you try and keep as many things fixed, including not updating the dataset), and then if the model becomes a critical automated pipeline, rather than rerunning the model every hour and uploading results to a Redis cache or something, you switch it over to Materialize, and don't have to every worry about cache invalidation.

abrazensunset · on March 2, 2021

In that situation (dual usage modes) I think I'd rather have the primary data store be Materialize, and just snapshot Materialize views back to your warehouse (or even just to an object store).

Then you could use that static store for exploration/fixed analysis or even initial development of dbt models for the Materialize layer, using the Snowflake or Spark connectors at first. When something's ready for production use, migrate it to your Materialize dbt project.

The way dbt currently works with backend switching (and the divergence of SQL dialects with respect to things like date functions and unstructured data), maintaining the batch and streaming layers side by side in dbt would be less wasteful than the current paradigm of completely separate tooling, but still a big source of overhead and synchronization errors.

If the community comes up with a good narrative for CI/CD and data testing in flight with the above, I don't think I'd even hesitate to pull the trigger on a migration. The best part is half of your potential customers already have their business logic in dbt.

jdotjdot · on March 2, 2021

I should clarify; I don't think that for the general case you have to go all-in on Materialize, but for the case in my comment--where you are effectively using business logic within Materialize views as the "source of truth" of logic across all of both your analytics and your operation--that requires buy-in. Additionally, if I'm _already_ sending all of my data to a database or to my data warehouse, ETLing all of that data to Materialize also is rather burdensome. Just because I technically could run Materialize side by side with something doesn't mean I necessarily want to, especially given the streaming use case requires a lot more maintenance to get right and keep running in production.

I fully agree with you that for many data science cases, you're likely to stick with batching. Where I see Materialize to be the most useful, and where I'd be inspired to use it and transform how we do things, would be the overlap between when Analytics team are writing definitions (e.g., what constitutes an "active user"?) and are typically doing so on the warehouse, but then I want those definitions to be used, up to date, and available everywhere in my stack, including analytics, my operational database, and third-party tools like marketing tools.

Personally, I'm less interested in one-off migrations like you're suggesting. What I really want is to have something like Materialize embedded in my Postgres. (Such a thing should be doable at minimum by running Materialize + Debezium side-by-side with Postgres and then having Postgres interact with Materialize via foreign data wrappers. It would need some fancy tooling to make it simple, but it would work.) In such a scenario, a Postgres + Materialize combo could serve as the "center of the universe" for all the data AND business definitions for the company, and everything else stems from there. Even if we used a big data warehouse in parallel for large ad hoc queries (which I imagine Materialize wouldn't handle well, not being OLAP), I would ETL my data to the warehouse from Materialize--and I'd even be able to include ETLing the data from the materialized views, pre-calculated. If I wanted to send data to third-party tools, I'd use Materialize in conjunction with Hightouch.io to forward the data, including hooking into subscriptions when rows in the materialized views change.

For what I propose, there are some open questions about data persistence, high availability, the speed to materialize an initial view for the first time, and backfilling data, among other things. But I think this is where Materialize has a good chance of fundamentally changing how analytical and operational data are managed, and I think there's a world where data warehouses would go away and you'd just run everything on Postgres + Materialize + S3 (+ Presto or similar for true OLAP queries). I could see myself using Materialize for log management, or log alerting. I'm just as excited to see pieces of it embedded in other pieces of infrastructure as I am to use it as a standalone product.

arjunnarayan · on March 2, 2021

Thank you very much for the elaboration, I really appreciate the thinking!

> Personally, I'm less interested in one-off migrations like you're suggesting. What I really want is to have something like Materialize embedded in my Postgres.

We're about to launch "Materialize as a Postgres read-replica" where it connects to a Postgres leader just as a Postgres read-replica would - using the built-in streaming replication in newer versions of postgres. It's currently in final testing before being released in the next month or two.

https://github.com/MaterializeInc/materialize/issues/5370

Also on our roadmap for Materialize Cloud is plug-and-play connections with Fivetran, Hightouch, and Census (and more) to bring in the business data, and allow you to, as you put it, make Materialize the central collection point for keeping updated all your views.

gbrits · on March 2, 2021

Exciting! Does this already have the concept of backfilling as mentioned by GP. Interested to know as well

frafra · on March 2, 2021

I agree with you. If you are focused on PostgreSQL, you could find interesting the work being done on IVM: https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc... - https://github.com/sraoss/pgsql-ivm

andrenotgiant · on March 1, 2021

Relevant related post: A data pipeline is just a materialized view: https://nchammas.com/writing/data-pipeline-materialized-view

jldlaughlin · on March 1, 2021

Neat! We're totally on the same page--incremental view maintenance not only makes materialized views a useful building block for data pipelines, it can make them much simpler, too!

sitkack · on March 2, 2021

So it is CTEs all the way down, a bag of dags. A raw table is just a view over the raw data, a cooked view is just a view of views, repeat.

Do you use any of the ideas of Noria? Cadence?

This is great, but at what COST?!

touisteur · on March 1, 2021

Mmmh I've been thinking a lot about generated/computed fields as I wanted to use them in pg. They were introduced in pg12 but only materialized.

d_watt · on March 1, 2021

Materialize seems really, really interesting. To the point where my crotchety old man senses are telling me that I should be careful about getting too excited about new technology.

Are there any interesting case studies of people pushing the edge of what's possible? For my use case, I have max throughput of 10k updates/second potentially materialized into tens of thousands of different views (with some good partition keys available, if needed).

albertwang · on March 1, 2021

Nothing we can share publicly at the moment yet, but if you reach out and chat, we're more than happy to give you some numbers that I think will address what you're looking for!

esteer · on March 2, 2021

AWS launched a feature last re:Invent for materialized views - https://aws.amazon.com/glue/features/elastic-views/

efangs · on March 1, 2021

Is it possible to backfill a materialized view? I'm maybe confused, but does this potentially replace a traditional data lake or is it specifically for streaming applications?

jldlaughlin · on March 1, 2021

It is! We recently added support for S3 sources [0], which you could use to backfill data and union with a stream.

To your other question, we're currently well-suited for streaming applications. Moving forward, as we add support for features like persistence, we could certainly replace at least parts of a traditional data lake.

[0]: https://materialize.com/docs/sql/create-source/json-s3/

Nican · on March 1, 2021

Materialize is a really interesting solution, and I love for what it stands.

But the documentation is missing more details about the architecture overview. A single update could cause many gigabytes of data to shift on the materialize view, and I do not understand how Materialize would handle that scale.

benesch · on March 1, 2021

What you're asking about is the magic at the heart of Materialize. We're built atop an open-source incremental compute framework called Differential Dataflow [0] that one of our co-founders has been working on for ten years or so.

The basic insight is that for many computations, when an update arrives, the amount of incremental compute that must be performed is tiny. If you're computing `SELECT count(1) FROM relation`, a new row arriving just increments the count by one. If you're computing a `WHERE` clause, you just need to check whether the update satisfies the predicate or not. Of course, things get more complicated with operators like `JOIN`, and that's where Differential Dataflow's incremental join algorithms really shine.

It's true that there are some computations that are very expensive to maintain incrementally. For example, maintaining an ordered query like

    SELECT * FROM relation ORDER BY col

would be quite expensive, because the arrival of a new value will change the ordering of all values that sort greater than the new value.

Materialize can still be quite a useful tool here, though! You can use Materialize to incrementally-maintain the parts of your queries that are cheap to incrementally maintain, and execute the other parts of your query ad hoc. This is in fact how `ORDER BY` already works in Materialize. A materialized view never maintains ordering, but you can request a sort when you fetch the contents of that view by using an `ORDER BY` clause in your `SELECT` statement. For example:

    CREATE MATERIALIZED VIEW v AS SELECT complicated FROM t1, t2, ... -- incrementally maintained
    SELECT * FROM v ORDER BY col LIMIT 5                              -- order and limit computed ad hoc, but still fast

[0]: https://github.com/TimelyDataflow/differential-dataflow

Nican · on March 1, 2021

Thank you for taking your time to write that up. To illustrate my point a little more:

Suppose a customer has several resources, and each resource has several metrics. From my understanding, Materialized could be used to have an aggregated view of metrics per customer.

The problem is that resources can also be migrated between customers. When a resource migrates between customers, the whole history of the customer changes. This could cause huge updates depended on how many resources are moved, or how many metrics per resource are being collected.

I have a conundrum between doing the "customer-resource join" late, and causing huge CPU cost when running queries. Or making aggregates early, and then having huge Disk cost when migrating resources. At the moment, we just have daily jobs that aggregates the TBs of customer data daily, because there is no way to do the joins in real-time.

Is Materialize designed to be able to handle something like this?

benesch · on March 2, 2021

So, roughly speaking, I think you're describing a view that could be defined in Materialize like so:

    CREATE MATERIALIZED VIEW customer_avg_latency
    SELECT customer_id, metric_id, avg(metric_val)
    FROM metrics
    JOIN resources ON metrics.resource_id = resources.id
    JOIN customers ON resources.customer_id = customer.id
    GROUP BY customer_id, metric_id

There are various ways to slice and dice that query as a user that'll allow you to choose whether to do the aggregate early or late. The details, I think, depend on the specifics of your data model.

> When a resource migrates between customers, the whole history of the customer changes.

I'm a bit confused about this part. In the query I posted above, Differential's incremental join algorithm would spare you from recomputing the entire join. (As written the query would have to re-aggregate the metrics associated with that resource that moved, but that could be solved by manually pushing the avg down.)

If you'd like to drop by our Community Slack (https://materialize.com/s/chat) or GitHub issues (https://github.com/MaterializeInc/materialize/issues) we'd be happy to talk through things in more detail!

glogla · on March 1, 2021

At the moment, it does it by not allowing updates at all.

It also forbids some parts of SQL that could get you update-like functionality, like appending new versions of records to a stream and then running

    select * from (select *, row_number() over (partition by pk order by updated_at desc) from stream) where row_number = 1

or something.

(Boy I wish there was less awkward way to do this.)

EDIT: I'd love to ship bunch of data from few CDCs to Materialize and then run realtime reporting on that but without updates or window functions, Materialize can't do that just yet.

EDIT2: The part about updates isn't true, see below.

jldlaughlin · on March 1, 2021

We (I work at Materialize) actually do support updates! Our CDC sources support them, as well as any source using an UPSERT envelope (more info here: https://materialize.com/docs/sql/create-source/text-kafka/#u...).

As per your second point, I have a less awkward way for you! Materialize supports a top-k idiom (https://materialize.com/docs/sql/idioms/#top-k-by-group) that is hopefully a bit more clear.

glogla · on March 1, 2021

Right, I saw no update in the SQL keyword list, and saw some (I guess old now) talk about Materialize where the speaker mentioned no updates, sorry!

jldlaughlin · on March 1, 2021

No apologies necessary, we're an active work in progress! (And, hopefully, moving quickly!)

victor106 · on March 2, 2021

If I have a batch file in EDI format how can I use DBT and/or Materialize?