The last-hop switches facing the servers are layer 3 in the direction of the aggregation switches and layer 2 in the direction of the servers so ARP is handled normally and DHCP can be handled with standard DHCP-helper mechanisms. This allows the host networking stack to be vanilla (if you want it to be).
One scale problem with layer 2 networks is you have to go to layer 3 at some point and you have an aggregation node(s) that have to do a considerable amount of ARP request/reply traffic. The request/reply traffic is handled by the CPU of the machine processing. This can run the CPU hotter than you really want and if the aggregation nodes are control plane policing ARP (most sw vendors do), they will intermittently drop arp requests.
Once the request/reply is processed, the node pushes the ARP entry into hardware. ARP hardware tables are limited resources as well. If you exceed them, traffic will either be black-holed or forwarded by the CPU of the node. Both of these are undesirable.
Additionally, it becomes difficult to scale forwarding capacity of layer 2 spanning tree (or loop free) based networks to greater than 2 aggregation nodes since you will either block ports or create loops.
To run BGP, you would not have to run /32s everywhere. You could have multiple tiers of aggregation that keep your routing tables manageable. It would look something like this.
If you need some specifics to be advertised, scale would be dependent on the number of specifics that you desire to be advertised out of rack in the case above, so it would look something like this.
On the note about label switching versus ip forwarding, I would not think of it as either/or but rather both could be used since you need to eventually serve pages to users.
Yeah, i was specifically thinking of the prefixes jn the tor and agg layer. I read it as having all the vips in the rack as next hop /32s. Add a bunch of intra rack ecmp groups and its not insignificant. Im not sure i buy the feasability of /27 per rack or a simple /17 supernet at the agg layer long term. Rack address counts will not be uniform, id expect everything from /29s to /23s depending on host types. Assuming uniform allocation at the pod/dc level also seems like a stretch. Renumbering and super/subnetting is going to fragment the ip locality over time. Ecmping through the core eats yet more prefixes, and (Id guess) forces hot ones to be deagg'd.
All that said, you do get a couple thousand prefixes on your $10,000 trident box. And im obviously not an NE. So maybe it would work inside a small cluster.