Agreed, one of the craziest bugs I had to deal with was we had a distributed sys...

gfv · on Nov 22, 2022

Yeah, network scaling bugs are the most fun. The one I liked the most was when after expanding a pool of servers, they started to lose connectivity for a few minutes and then come back a minute or so later as if nothing happened.

Turns out we accidentally stretched one server VLAN too wide, to roughly 600 devices within one VLAN within one switch. The servers had more-or-less all-to-all traffic, and that was enough to generate so many ARP requests and replies that the switch's supervisor policer started dropping them at random, and after ten failed retries for one server the switch just gave up and dropped it from the ARP table.

Of course the control plane policer is global for the switch, so every device connected to the switch was susceptible, not just the ones in the overextended VLAN.

dekhn · on Nov 22, 2022

vlans are convenience that is the enemy of performance and undertandability.

pjc50 · on Nov 22, 2022

They're a great alternative to having to go down the datacentre mines and replug a few thousand cables, though.

Varriount · on Nov 22, 2022

Goodness, what kind of process/tools did you use to track that problem down?

katekarin · on Nov 22, 2022

My team had a similar issue with the ARP cache on AWS when we used Amazon Linux as an OS for cluster nodes, and Debian for the database host. When new tasks were starting some had random timeouts when connecting to the database.

It turned out that the Debian database host had bad ARP entries (an IP address was pointing to a non-existing MAC Address) caused by frequent reuse of the same IP addresses.

Debian has a default ARP cache size that's larger than Amazon Linux (I think it's entirely disabled on AL?).

As for the tooling we used to track it down, it was tcpdump. We saw SYN's getting sent, but not ACK's back. Few more tcpdump flags (-e shows the hardware addresses) and we discovered mismatched MAC addresses.

lcw · on Nov 22, 2022

We didn't have much tooling outside of typical things you would find in a Linux distro. It started with trying to isolate a node having issues. Then looking at at application and kernel logs. Then testing the connection to the other node via ping or telnetting a port I knew should be open. Found out I couldn't route then just process of elimination from that point till we managed our way to looking at a full ARP cache. Tested that we could increase the ARP cache size to fix the issue. Then figured out by going through the kernel why it wasn't releasing correctly by looking at the source code for the release we were using. I'm simplifying some discovery, but there was no magic unfortunately.