What kind of workloads spawn so many processes that saving microseconds becomes ...

jcranmer · on Sept 14, 2022

Spawning a process consists of basically three steps: create a new process, twiddle attributes of the new process, and load the new process image. And for this, we (presently) have three choices: fork, vfork, and posix_spawn.

The cost of fork is proportional to the complexity of the spawning process (e.g., the size of its page tables), not the spawned process. The talk gives an example where fork takes 7ms, although I'm not entirely clear what it's doing (I only read the slides, not listened to the talk).

vfork and posix_spawn allow you to avoid that complexity. However, this comes at a price: posix_spawn only permits a very limited set of process attribute twiddling, and vfork is so unsafe that it's difficult to sanction its use... well in general.

From the slides, the intention is to provide something like vfork except without the inherent unsafety of vfork. It describes vfork as "Effectively a thread with no synchronization running with the same stack as the parent". Meanwhile, you can look at io_uring as describing a kernel thread that does a limited set of tasks... so why not implement something like vfork on top of io_uring? Of course, the interesting question is what process bits can be twiddled in io_uring_spawn, but unfortunately, there's no examples from what I can see to get a sense of how powerful it is or isn't.

JoshTriplett · on Sept 14, 2022

> The cost of fork is proportional to the complexity of the spawning process (e.g., the size of its page tables), not the spawned process. The talk gives an example where fork takes 7ms, although I'm not entirely clear what it's doing (I only read the slides, not listened to the talk).

I allocated 1GB of memory, wrote a byte to every 4KB page of that memory, then did the same fork benchmark. The kernel has to do much more work to copy the page metadata associated with the resulting 262144 page frames.

> Meanwhile, you can look at io_uring as describing a kernel thread that does a limited set of tasks... so why not implement something like vfork on top of io_uring? Of course, the interesting question is what process bits can be twiddled in io_uring_spawn, but unfortunately, there's no examples from what I can see to get a sense of how powerful it is or isn't.

We're going to need to add a number of additional io_uring operations to provide the full capabilities of posix_spawn, but I'm expecting to do so incrementally. Eventually: everything you can do by making a syscall, at a minimum.

the_duke · on Sept 14, 2022

I watched the talk now.

The last benchmark column touches 1gb of memory,while the second just allocates without modifying the memory, so I guess all that overhead is page table CoW mapping.

During QnA they also talk about exposing as many config calls as possible via uring.

yxhuvud · on Sept 14, 2022

Just something as simple as setting up a threadpool adds up timewise if you do it in serial. It can easily be that making it faster adds up use cases where processes are not possible today because they would be too slow to set up compared to the alternatives.

What I missed from the presentation is some discussion about the asyncronicity of the clone calls. Yes, the average time of cloning once is 30us, but if you queue up two of them at once, how long does that take? Will it just sum or will the total time be shorter? Today there are applications that set up a thread pool by recursively open new threads, both in the original thread and in the forked ones. That is more than a little bit icky and could perhaps be simplified with this.

JoshTriplett · on Sept 14, 2022

> Yes, the average time of cloning once is 30us, but if you queue up two of them at once, how long does that take?

I'm working on a new architecture for this right now that should make this case work well.

xroche · on Sept 14, 2022

> What kind of workloads spawn so many processes that saving microseconds becomes relevant?

With a huge process, you have a timeframe between the child is spawned and it executes exec*() where you typically "do stuff" (such as closing a lot of fd)

During this timeframe the parent process has its universe COW'ed, and each write will trigger a page fault.

The performance impact can be concerning in the _parent_ process.

staticassertion · on Sept 14, 2022

IDEs can go insane with process executions. VisualStudio shells out like crazy when doing builds.

Also, processes are an awesome abstraction - they give security and reliability. Reducing their cost is nice.

pxeger1 · on Sept 14, 2022

Making process spawning faster would allow for workloads to spawn "so many" processes/threads significantly more practically, and therefore more simply than userspace async-style runtimes.

(Note that threads on Linux are just processes which share memory and some other properties.)

Merely halving it may not be all that's needed to totally replace those runtimes but it's a good step in the right direction.

crest · on Sept 14, 2022

This kind of optimisation becomes more important as the parent process's address space grows e.g. a browser wanting to spawn a process per origin or even worse an in memory database spawning a tiny shell script as hook. The existing (v)fork()+exec() syscalls have to change the existing address space to copy on write which can involve shooting down TLB entries on all other CPU cores with IPIs (inter-processor interrupts) and setup new page tables for the child only to tear them down immediately and bring in a new executable. It also doesn't play nice with unified event loops implemented with the common epoll()/kqueue() etc. syscalls which is a problem for languages like go which depend on event loops for their runtimes.

mark_undoio · on Sept 14, 2022

I think vfork is intended to avoid the CoW overhead? (at the cost of having really bizarre other semantics that make it tricky to use)

Since it pauses the execution of the parent process (or at least the parent thread?) it gives a window for the child to use the current stack frame to set up an exec() call.

When I first learned about vfork() (early 2000s) I had the impression it had become a bit pointless since CoW sharing had made fork() really fast. But with modern-sized address spaces it is (has become?) horribly slow again.

the8472 · on Sept 14, 2022

For example rust doctests consist of compiling each little doc snippet and executing it ensure that the asserts are true. That's two processes (rustc + the compiled piece of code) per test. This is an order of magnitude slower than multi-threaded in-process tests. Process startup and teardown is a significant fraction of that overhead.

Now fork+exec isn't the biggest chunk of that. But it fingers the memory subsystem. On many-core systems this leads to lock contention. So the cheaper starting a new process gets in terms of virtual memory operations the less lock-contention you get overall which also speeds up other parts.

vinay_ys · on Sept 14, 2022

Forking process per request in a request handling server would be one example where process fork time would add to request latency. (this would be happening during worker process pool expansion).

sc68cal · on Sept 14, 2022

I'm not sure if it's really used anymore but inetd is a perfect usecase

turminal · on Sept 14, 2022

If you're writing anything with performance considerations using shell scripts, you're almost certainly doing something wrong. If it includes spawning thousands of processes, you're definitely doing something wrong.

Thaxll · on Sept 14, 2022

Isn't spawning a POSIX thread the same thing?

the8472 · on Sept 14, 2022

Threads are much cheaper to spawn compared to processes because threads share memory while forked processes are copy-on-write which requires updating the page tables for the parent and copying the kernel bookkeeping entries over to the child.

ptrwis · on Sept 14, 2022

fork bomb