More

shalabhc · 2026-02-22T18:33:46 1771785226

For all interested in this topic, I highly recommend the book Designing Data Intensive Applications https://www.goodreads.com/book/show/23463279-designing-data-....

It goes into not only different isolation levels, but also some ambiguity in the traditional ACID definition.

I believe a 2nd edition is imminent.

vismit2000 · 2026-02-23T04:46:36 1771821996

Second edition is available now: https://www.oreilly.com/library/view/designing-data-intensiv...

shalabhc · on Jan 2, 2025

+1

What could be useful here is if postgres provided a way to determine the latest frozen uuid. This could be a few ms behind the last committed uuid but should guarantee that no new rows will land before the frozen uuid. Then we can use a single cursor track previously seen.

shalabhc · on Nov 17, 2024

We have been using pex to deploy code to running docker containers in ECS to avoid the cold start delay. Cuts down the iteration loop time for development significantly.

https://dagster.io/blog/fast-deploys-with-pex-and-docker

shalabhc · on Aug 22, 2024

The question is whether the concept of data is essential to how we structure computation.

Computation is a physical process and any model we use to build or describe this process is imposed by us. Whether this model should include the concept of data (and their counterparts "functions") is really the question here. While I don't think the data/function concept is essential to modeling computation, I also have a hard time diverging too far from these ideas because that is all I have seen for decades. I believe Kay is challenging us to explore the space of other concepts that can model computation.

js8 · on Aug 22, 2024

IMHO the article is about "data" as in "personal information", but let's indulge in your generalization.

In logic, you have definitions that are extensional (list of objects of the defined type) or intensional (conditions on the object of the type). Perhaps you can think of first representation as data, and the other representation as model or program.

But it's not trivial to convert between the two representations. From extensional to intensional is machine learning, and the other way you face a constraint satisfaction problem.

If we could somehow do both efficiently enough, then perhaps we could represent everything intensionally, as generative programs, and get rid of "data". But we don't know how to do this or whether it is possible.

TeMPOraL · on Aug 22, 2024

> IMHO the article is about "data" as in "personal information", but let's indulge in your generalization.

That is weird on its own, because I don't see how Kay's quote could be in any way about it. I appreciate the article on its own, but in context of the quote and discussion it's taken from, it feels like it's only related because it uses the word "data".

shalabhc · on Aug 8, 2024

I would suggest the author also look at Amazon Ion:

* It can be used as schema-less

* allows attaching metadata tags to values (which can serve as type hints[1]), and

* encodes blobs efficiently

I have not used it, but in the space of flexible formats it appears to have other interesting properties. For instance it can encode a symbol table making symbols really compact in the rest of the message. Symbol tables can be shared out of band.

[1] https://amazon-ion.github.io/ion-docs/docs/spec.html#annot

shalabhc · on May 3, 2024

While well known for this paper and "information theory", Shannon's master's thesis* is worth checking out as well. It demonstrated some equivalence between electrical circuits and boolean algebra, and was one of the key ideas that enabled digital computers.

* https://en.wikipedia.org/wiki/A_Symbolic_Analysis_of_Relay_a...

B1FF_PSUVM · on May 3, 2024

The funny thing is that, at the time, digital logic circuits were made with relays. For most of the XX century you could hear relays clacking away at street junctions, inside metal boxes controlling traffic lights.

Then you got bipolar junction transistors (BJTs), and most digital logic, such as ECL and TTL, was based on a different paradigm for a few decades.

Then came the MOS revolution, allowing for large scale integration. And it worked like relays used to, but Shannon's work was mostly forgotten by then.

mturmon · on May 3, 2024

> Then you got bipolar junction transistors (BJTs), and most digital logic, such as ECL and TTL, was based on a different paradigm for a few decades.

I think the emphasis is misplaced here. It is true that a single BJT, when considered as a three terminal device, does not operate in the same “gated” way as a relay and a CMOS gate does.

But the BJT components still were integrated into chips, or assembled into standard design blocks that implemented recognizable Boolean operations, and synthesis of desired logical functions would use tools like Karnaugh maps that were (as I understand it) outgrowths of Shannon’s approach.

kragen · on May 4, 2024

i don't think traffic light control systems were designed with boolean logic before shannon, nor were they described as 'logic' (or for that matter 'digital')

egl2021 · on May 3, 2024

This is my candidate for the most influential master's thesis ever.

dbcurtis · on May 3, 2024

> The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have meaning...

This is my candidate for the sickest burn in a mathematical journal paper...

zhangsen · on May 4, 2024

“Information theory” might be a misnomer.

keepamovin · on May 4, 2024

nsajko · on May 4, 2024

Hint: the title of this post, and of the paper, uses communication instead of information.

keepamovin · on May 4, 2024

Aaah. That’sa good point. Hehe :) but I mean he does really delve into information in his exposition. Very clear.

Tho I’m intrigued by the idea that there’s more to information than Shannon. Any additional hints??

I’ve often thought about a dual metric with entropy called ‘organization’ that doesn’t Measure disorder/surprise, but measures structure, Coherence. But I don’t think it’s exactly a dual.

shalabhc · on April 15, 2024

> Instead, we can check whether any of the writes between the begin timestamp and commit timestamp overlap with our transaction’s read set.

Do you handle the case where the actual objects don't overlap but result of an aggregate query is still affected? For instance a `count(*) where ..` query is affected by an insert.

james_cowling · on April 15, 2024

Yes the readsets aren't based on lists of objects but rather the data ranges that a query touches, i.e., they don't suffer from phantom read anomalies.

If you were to attempt to fetch all items from an empty shopping cart, that query will automatically be invalidated if any item is added to the cart, even though there were no documents returned from the original query. Query intersection always works.

The aggregate example isn't a great one in Convex since it doesn't currently support built-in aggregates - a full-table aggregate involves a table scan and therefore the readset would be the entire index. We may add built-in aggregates that are incrementally computed if there's enough demand.

shalabhc · on March 15, 2024

"Lots of Little Databases" reminded me of https://www.actordb.com/ which does lots of server-side sqlite instances, but the project now looks defunct.

shalabhc · on March 14, 2024

Global consistency is expensive, both latency-wise and cost-wise. In reality most apps don't need global serializability across all objects. For instance, you probably don't need serializability across different tenants, organizations, workspaces, etc. Spanner provides serializability across all objects IIUC - so you pay for it whether you need it or not.

The other side of something like Spanner is the quorum-based latency is often optimized by adding another cache on top, which instantly defeats the original consistency guarantees. The consistency of (spanner+my_cache) is not the same as the consistency of spanner. So if we're back to app level consistency guarantees anyway, turns out the "managed" solution is only partial.

Ideally the managed db systems would have flexible consistency, allowing me to configure not just which object sets need consistency but also letting me configure caches with lag tolerance. This would let me choose trade-offs without having to implement consistent caching and other optimization tricks on top of globally consistent/serializable databases.

foota · on March 14, 2024

While it doesn't help much with the costs of replication, Spanner can be configured with read only replicas that don't participate in voting for commits, so they don't impact the quorum latency.

Reads can then be done with different consistency requirements, e.g., bounded staleness (which guarantees data less stale than the time bound requested).

See https://cloud.google.com/spanner/docs/reference/rest/v1/Tran... or https://cloud.google.com/spanner/docs/reads#read_types and https://cloud.google.com/spanner/docs/create-manage-configur...

eatonphil · on March 14, 2024

See also: "Strict-serializability, but at what cost, for what purpose?"

https://muratbuffalo.blogspot.com/2022/08/strict-serializabi....

shalabhc · on Aug 27, 2023

Check out jwz's blog posts, eg https://www.jwz.org/blog/2023/01/mozilla-orgs-25th-anniversa...

You may find other interesting articles linked from here:https://en.wikipedia.org/wiki/Jamie_Zawinski, eg https://www.jwz.org/gruntle/nomo.html