I thought about that, and just thought it read better the other way, but also I'm the oldest 40-something on Hacker News so "class C" and "class B" might just be more vivid for me than for normal people.
If there is something with a socket of any sort that you can't do on Fly, I'd like to know about it, so I can make Kurt write some BPF code for a change.
In the spirit of nitpicking, class A, B, and C referred to specific address blocks, not network sizes. A /24 in the class A range was still class A rather than class C.
I worked on network/firewall at a rather large bank from 2009-2019 and we used the class labels and "slash 24s" etc. interchangeably when talking. Not that you're incorrect, but that the slang was used and everyone knew what people meant by it.
If I were to run a container that handled tcp connections, is there any sort of max connection duration?
One of the projects I have worked on involved accepting tcp connections from cellular connected race timing boxes, dealing with their protocol and sending the resulting data into another system.
I guess you could think along the lines of syslog or MQTT.
This isn't very bandwidth or CPU intensive, but does require keeping the sockets open for hours at a time.
This has come up before! If you like, you can hit me up at thomas@fly.io and I can take a whack at telling you how well this'll work out-of-the-box. What I can say is that "ordinary" TCP connections at Fly run through our Anycast proxy infrastructure, which does expect that it can bounce connections whenever it needs to --- but we can route around that for long-term connections, which is a thing that people doing, for instance, WebRTC want.
From my understanding, wireguard makes it so that all traffic between two physical nodes shares the same 5-tuple. In addition, ECN nor DSCP are copied in or out of the packet. Do you ever run into trouble with ensuring your flows are well balanced across ECMP?
> ...when Fly.io launched, it had a pretty simple use case: taking conventional web applications and speeding them up, by converting them into Firecracker MicroVMs that we can run close to users on a network of servers around the world.
I'm a little confused as to how come you don't run a GUA /128 to each container (out of a per-customer GUA /64 say). If this is from an address range you own, you can filter and drop packets at ingress/egress from the customer address space in case somehow traffic is misproperly routed (not via the WG tunnel). As new containers go on and offline, you can distribute this state through the system, and if a user reaches out to an IP that you don't know whether or not it's assigned, you can consult an oracle as to the location.
That was the first design I considered. The "distribute the state through the system" thing is the hard part.
(There's no possibility in our system of routing customer traffic outside of WireGuard, because our entire connectivity fabric is WireGuard; nothing but WireGuard, for instance, would know what to do with an fdaa address, or, for that matter, how to reach an instance's IPv4 addresses.)
We can filter and drop packets at ingress/egress with this design as well.
What's the advantage you see to having assignments out of per-customer /64s?
This sounds quite a bit more complicated than what we're doing, so the answer is "we angrily avoid complexity".
FOne interesting problem we have is that apps span regions, so the less information we have to propagate when VMs come up, the better. The 6PN routing is already established. When a VM boots it's available immediately because there's nothing to propagate.
I love the writing style here. It seems to have the passion and perspective of a founder but is super technical. I’m curious how writing like this gets executed at a company like this. What’s the workflow?
This is from a super technical founder, so we're lucky! The process isn't much of a process. Extracting this type of content from other technical folks is a nut we haven't cracked yet.
The workflow is literally: tptacek writes a post, we fix inevitable typos, and then we merge it.
I guess there's no reason you can't have defense in depth, but I'm wondering if this private virtual IPv6 network for your whole organization might end up being considered "secure enough" by the complacent? Like being behind the firewall in the old days.
Careful; "organization" here is just a Fly.io account term. You can have multiple organizations, and segment however you'd like. What we're doing is analogous to AWS VPCs, just with IPv6 and static routing. (Probably, we should revisit the terminology).
> Our WireGuard mesh sees IPv6 addresses that look like fdaa:host:host::/48 but the rest of our system sees fdaa:net:net::48.
Is it not feasible to make wireguard more flexible with a bit mask? The annoying thing with having to use 1:1 NAT in a packet pipeline (which is what you’ve done here) is that logs, metrics, etc get all screwed up for correlation because wireguard doesn’t see the same things everyone else does.
I guess putting wireguard in the kernel wasn’t such a hot idea after all?
It's totally feasible; I could probably write the PR for that in a day, and I'm not a kernel dev. But it's an oddball feature, and one of the big reasons WireGuard works in the kernel is that it's deliberately tiny.
Think about how big the BPF program is that implements the switcheroo we do to get around this. It's not, like, a big ask.
> To lock an instance into a 6PN network, all we really need is a trivial BPF program that enforces the “don’t cross the streams” rule: you can’t send packets between different 6PN prefixes.
Are you using eBPF for this egress filtering? My (hopefully mis)understanding is XDP programmes, the higher performance cousin of eBPF, can only work on ingress packets currently.
> The unlucky bit is WireGuard. XDP doesn't really work on WireGuard; it only pretends to (with the "xdpgeneric" interface that runs in the TCP/IP stack, after socket buffers are allocated). Among the problems: WireGuard doesn't have link-layer headers, and XDP wants it to; the discrepancy jams up the socket code if you try to pass a packet with XDP_OK. We janked our way around this with XDP_REDIRECT, and Jason Donenfeld even wrote a patch, but the XDP developers were not enthused, just about the concept of XDP running on WireGuard at all, and so we ended up implementing the worker side of this in TC BPF.
If only fearsome-bagel is supposed to access serf, can serf do a reverse lookup on the source IP and then make sure it matches regexp /^fearsome-bagel-\d+\.internal$/?
Fascinating — have considered something similar. How are you dealing with the difficulties of core load balancing and ecmp that wg’s information hiding create?
Sorry: The above was meant as a reply to this question [0].
> I love the writing style here. It seems to have the passion and perspective of a founder but is super technical. I’m curious how writing like this gets executed at a company like this. What’s the workflow?
Well 6PN’s are effectively VPCs and within-org ACLs could be analogous to security groups - not every app within an org needs to be able to talk to every other app, though there are not clear “islands” of communication within the graph so using separate orgs for distinct communities of apps isnt a tenable strategy. Segmenting the bottom 48 bits into app Id / app instance Id and then doing more shenanigans at the wg routing layer would provide you with some of this functionality. This is something I am thinking about for my employer (Netflix) which does not offer a public PaaS, but is also trying to scale and secure a big flat network.
I do believe that the dual goals of network identity and scalability are difficult to solve for simultaneously (in the old world, we relied on subnets, but this is the opposite of multi-tenant container hosts..) so i think we want to have simple physical networks with routing and then some other means of trusted network identity, which leads you to crypto-derived solutions (really, it seems like wg and IPSec are the biggest contenders here and both have pros and cons)
Anyway, seeing your post was validating about this line of reasoning and I am curious to follow along to see how you are thinking about it more.
and vicious cycle (lack of demand -> cant sustain to pay for equipment and maintenance cost -> rise higher price to compensate -> be more expensive -> subscribers cant pay -> quit -> lesser demand -> rinse and repeat)
You're not wrong. The article is trying to do two things at once: convey to Fly users a new feature we launched and what to do with it, and also talk to the Internet about how we build stuff just as a part of, like, a broader tech industry conversation. The two goals ask for different tones, and they clash.
The right thing to do would be to write two blog posts, but there's no better way to get me not to write something than to tell me I have to write two somethings.
tons of PaaS providers are no help at all if you want to run an app that uses ZeroMQ or basically anything that isn't http.
One super small nitpick though, and maybe it's just me, but
> Technically, what we end up delegating to each instance is a /112, which is the IPv6 equivalent of an IPv4 Class B address
> WireGuard peers get /120 delegations (the equivalent of an IPv4 class C)
Should be
> Technically, what we end up delegating to each instance is a /112, which is the IPv6 equivalent of an IPv4 /16 network.
> WireGuard peers get /120 delegations (the equivalent of an IPv4 /24)
classes haven't really been a thing for almost 30 years.