> IIRC we modified our database to run off ramdisk If a ramdisk is sufficient fo...

mdaniel · 2025-06-21T17:33:11 1750527191

> Do you update the cluster state more than 1000 times per second? Curious what operation needs that.

Ooo, ooo, I know this one! It's for clusters with more than approximately 300 Nodes, as their status polling will actually knock over the primary etcd cluster. That's why kube-apiserver introduced this fun parameter: https://kubernetes.io/docs/reference/command-line-tools-refe... and the "but why" of https://kubernetes.io/docs/setup/best-practices/cluster-larg...

I've only personally gotten clusters up to about 550 so I'm sure the next scaling wall is hiding between that and the 5000 limit they advertise

nh2 · 2025-06-23T15:29:28 1750692568

> their status polling will actually knock over the primary etcd cluster

Polling the status sounds like a read-only operation, why would it trigger an fsync?

mdaniel · 2025-06-23T18:35:39 1750703739

kubelet constantly checks-in to report "I am alive, and here is the state of affairs" so that kube-apiserver can make informed decisions about whether a Pod needs attention to align with its desired state

So, I don't this second know what the formal reconciliation loop is called for that Node->apiserver handshake but it is a polling operation in that the connection isn't left open all the time, and it is a status-reporting operation. So that's how I ended up calling it "status polling." It is a write operation because whichever etcd is assigned to track the current state of the Events needs to be mutated to record the current state of the Events

It actually could be that even a smaller sized cluster could get into this same situation if there were a bazillion tiny Pods (e.g. not directly correlated with the physical size of the cluster) but since the limits say one cannot have more than 110 Pods per Node, I'm guessing the Event magnification is easier to see with physically wider clusters

nh2 · 2025-06-24T16:12:06 1750781526

I see, that makes sense. Thanks for the details!

haiku2077 · 2025-06-22T14:26:05 1750602365

> And the network is usually higher latency than a local SSD anyway.

Until your write outrun the disk performance/page cache and your disk I/O performance spikes. Linux used to be really bad at this when memory cgroups were involved until a cpuple of years ago.

> If a ramdisk is sufficient for your use case, why would you use a Raft-based distributed networked consensus database like etcd in the first place?

Because at the time Kubernetes required it. If the adapters to other databases existed at the time I would have tested them out.

> Kubernetes uses etcd for storing the cluster state. Do you update the cluster state more than 1000 times per second? Curious what operation needs that.

Steady state in a medium to large cluster exceeds that. At the time I was looking at these etcd issues I was running fleets of 200+ node clusters and hitting a scaling wall around 200-300. These days I use a major Kubernetes service that does not use etcd behind the scenes and my fleets can scale up to 15000 nodes at the extreme end.