This mentions Jupiter generations, which I think is about 10-15 years old at this point. It doesn't really talk about what existed before so it's not really 25 years of history here. I want to say "Watchtower" was before Jupiter? but honestly it's been about a decade since I read anything about it.
Google's DC networking is interesting because of how deeply integrated it is into the entire software stack. Click on some of the links and you'll see it mentions SDN (Software Defined Network). This is so Borg instances can talk to each other within the same service at high throughput and low latency. 8-10 years ago this was (IIRC) 40Gbps connections. It's probably 100Gbps now but that's just a guess.
But the networking is also integrated into global services like traffic management to handle, say, DDoS attacks.
Anyway, from reading this it doesn't sound like Google is abandoning their custom TPU silicon (ie it talks about the upcoming A3 Ultra and Trillium). So where does NVidia ConnectX fit in? AFAICT that's just the NIC they're plugging into Jupiter. That's probably what enables (or will enable) 100Gbps connections between servers. Yes, 100GbE optical NICs have existed for a long time. I would assume that NVidia produce better ones in terms of price, performance, size, power usage and/or heat produced.
Disclaimer: Xoogler. I didn't work in networking though.
The past few years there has been a weird situation where Google and AWS have had worse GPU's than smaller providers like Coreweave + Lambda Labs. This is because they didn't want to buy into Nvidias proprietary Infiniband stack for GPU-GPU networking, and instead wanted to make it work on top of their ethernet (but still pretty proprietary) stack.
The outcome was really bad GPU-GPU latency & bandwidth between machines. My understanding is ConnectX is Nvidias supported (and probably still very profitable) way for these hyperscalers to use their proprietary networks without buying Infiniband switches and without paying the latency cost of moving bytes from the GPU to the CPU.
Your understanding is correct. Part of the other issue is that at one point, there was a huge shortage of availability of IB switches... lead times of 1+ years... another solution had to be found.
RoCE is IB over Ethernet. All the underlying documentation and settings to put this stuff together are the same. It doesn't require ConnectX NIC's though. We do the same with 8x Broadcom Thor 2 NIC's (into a Broadcom Tomahawk 5 based Dell Z9864F switch) for our own 400G cluster.
Nvidia got ConnectX from their Mellanox acquisition -- they were experts in RMDA, particularly with Infiniband but eventually pushing Ethernet (RoCE). These NICs have hardware-acceleration of RDMA. Over the RDMA fabric, GPUs can communicate with each other without much CPU usage (the "GPU-to-GPU" mentioned in the article).
[I know nothing about Jupiter, and little about RDMA in practice, but used ConnectX for VMA, its hardware-accelerated, kernel-bypass tech.]
I would guess the Nvidia ConnectX is part of a secondary networking plane, not plugged into Jupiter. Current-gen Google NICs are custom hardware with a _lot_ of Google-specific functionality, such as running the borglet on the NIC to free up all CPU cores for guests.
Google's DC networking is interesting because of how deeply integrated it is into the entire software stack. Click on some of the links and you'll see it mentions SDN (Software Defined Network). This is so Borg instances can talk to each other within the same service at high throughput and low latency. 8-10 years ago this was (IIRC) 40Gbps connections. It's probably 100Gbps now but that's just a guess.
But the networking is also integrated into global services like traffic management to handle, say, DDoS attacks.
Anyway, from reading this it doesn't sound like Google is abandoning their custom TPU silicon (ie it talks about the upcoming A3 Ultra and Trillium). So where does NVidia ConnectX fit in? AFAICT that's just the NIC they're plugging into Jupiter. That's probably what enables (or will enable) 100Gbps connections between servers. Yes, 100GbE optical NICs have existed for a long time. I would assume that NVidia produce better ones in terms of price, performance, size, power usage and/or heat produced.
Disclaimer: Xoogler. I didn't work in networking though.