Hacker Newsnew | past | comments | ask | show | jobs | submit | kwillets's commentslogin

One of the founders of Antithesis gave a talk about this problem last week; diversity in test cases is definitely an issue they're trying to tackle. The example he gave was Spanner tests not filling its cache due to jittering near zero under random inputs. Not doing that appears to be a company goal.

https://github.com/papers-we-love/san-francisco/blob/master/...


Glad you enjoyed the talk! Making Bombadil able to take advantage of the intelligence in the Antithesis platform is definitely a goal, but we wanted to get a great open source tool into peoples’ hands ASAP first.

This site is underweighted on OLAP. Columnstores were invented for precisely this use case; nobody in the field wants to normalize everything.

Which brings me to the question, why a rowstore? Are Z-sets hard to manage otherwise?

Another aspect of wide tables is that they tend to have a lot of dependencies, ie different columns come from different aggregations, and the whole table gets held up if one of them is late. IVM seems like a good solution for that problem.


Good questions!

Feldera tries to be row- and column-oriented in the respective parts that matter. E.g. our LSM trees only store the set of columns that are needed, and we need to be able to pick up individual rows from within those columns for the different operators.

I don't think we've converged on the best design yet here though. We're constantly experimenting with different layouts to see what performs best based on customer workloads.


Can you be more specific about what type of data mining you're doing?


He doesn't want to manage the database the way he manages the rest of his infrastructure. All of his bullet points apply to other components as well, but he's absorbed the cost of managing them and assigning responsibilities.

- Crud accumulates in the [infrastructure thingie], and it’s unclear if it can be deleted.

- When there are performance issues, infrastructure (without deep product knowledge) has to debug the [infrastructure thingie] and figure out who to redirect to

- [infrastructure thingie] users can push bad code that does bad things to the [infrastructure thingie]. These bad things may PagerDuty alert the infrastructure team (since they own the [infrastructure thingie]). It feels bad to wake up one team for another team’s issue. With application owned [infrastructure thingies], the application team is the first responder.


The object storage stuff is new, but it's mostly confirmed that the older architecture works. MPP with shared (S3) storage and everything above that on local SSD and compute delivers the best performance. Even Snowflake finally came out with "interactive" warehouses with this architecture.

Parquet, Iceberg, and other open formats seem good, but they may hit a complexity wall. There's already some inconsistency between platforms, eg with delete vectors.

Incremental view maintenance interests me as well, and I would like to see it more available on different platforms. It's ironic that people use dbt etc. to test every little edit of their manually coded delta pipelines, but don't look at IVM.


Over time I've realized that the best abstraction for managing a computer is a computer.


This is a fairly simple idea of indexing characters for each column/offset and compressing the bitmaps. Simple is good, as the overhead of more sophisticated ideas (eg suffix sorting) is often prohibitive.

One suggestion is to index the end-of-string as a character as well; then you don't need negative offsets. But that turns the suffix search into a wildcard type of thing where you have to try all offsets, which is what the '%pat%' searches do already, so maybe it's OK.


AFAIK the most common design for these kinds of systems is using trigram posting lists with position information, i.e., where in the string does the trigram occur. (It's the extra position information that means that you don't need to re-check the string itself.) No need for many different bitmaps; you just take an existing GIN-like design, remove deduplication and add some side information.


It's been around for quite while, but DB people hate to explain where they got an idea. For all I know Vertica got it from somewhere else; I think postgres got jsonb around the same time.


My last dive into matrix computations was years ago, but the need was the same back then. We could sparsify matrices pretty easily, but the infrastructure was lacking. Some things never change.


On the software side I can recommend https://github.com/DrTimothyAldenDavis/GraphBLAS It is hard to make a sparse linear algebra framework, but Tim Davis has been doing a great job collecting the various optimal algorithms I to a single framework that acts more like an algebra than a collection of kernels.


I wrote part of that -- it needed a tight summary that covers all the main components, so I packed as much as I could into a sentence or two. I believe the second paragraph is mostly from me, but it was some time ago, back when Hanok revival was fairly new, and a lot of books were coming out on the subject.

A few years after that one of the Korean dailies wrote an English article and copied my wording. :(


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: