remix logo

Hacker Remix

The long road to lazy preemption in the Linux CPU scheduler

219 points by chmaynard 4 days ago | 50 comments

Hendrikto 4 days ago

> It all adds up to a lot to be done still, but the end result of the lazy-preemption work should be a kernel that is a bit smaller and simpler while delivering predictable latencies without the need to sprinkle scheduler-related calls throughout the code. That seems like a better solution, but getting there is going to take some time.

Sounds promising. Just like EEVDF, this both simplifies and improves the status quo. Does not get better than that.

amelius 4 days ago

> A higher level of preemption enables the system to respond more quickly to events; whether an event is the movement of a mouse or an "imminent meltdown" signal from a nuclear reactor, faster response tends to be more gratifying. But a higher level of preemption can hurt the overall throughput of the system; workloads with a lot of long-running, CPU-intensive tasks tend to benefit from being disturbed as little as possible. More frequent preemption can also lead to higher lock contention. That is why the different modes exist; the optimal preemption mode will vary for different workloads.

Why isn't the level of preemption a property of the specific event, rather than of some global mode? Some events need to be handled with less latency than others.

btilly 4 days ago

You need CPU time to evaluate the priority of the event. This can't happen until after you've interrupted whatever process is currently on the CPU. And so the highest possible priority an event can happen is limited by how short a time slice a program gets before it has to go through a context switch.

To stand ready to reliably respond to any one kind of event with low latency, every CPU intensive program must suffer a performance penalty all the time. And this is true no matter how rare those events may be.

zeusk 4 days ago

That is not true of quite a few multi-core systems. A lot of them, especially those that really care about performance will strap all interrupts to core 0 and only interrupt other cores via IPI when necessary.

xedrac 4 days ago

I learned this when I pegged core 0 with an intensive process on a little quad core arm device, and all of my interrupts started behaving erratically.

btilly 4 days ago

This strategy minimizes the impact by making one core less necessary. But it does not eliminate it.

zeusk 4 days ago

Sure which is a perfectly fine trade off; almost all recent CPUs have enough multicore capacity that make this trade favorable.

Someone 4 days ago

> You need CPU time to evaluate the priority of the event.

Not necessarily. The CPU can do it in hardware. As a simple example, the 6502 had separate “interrupt request” (IRQ) and “non-maskable interrupts (NMI) pins, supporting two interrupt levels. The former could be disabled; the latter could not.

A programmable interrupt controller (https://en.wikipedia.org/wiki/Programmable_interrupt_control...) also could ‘know’ that it need not immediately handle some interrupts.

themulticaster 4 days ago

The user you replied to likely means something different: The priority of the event often depends on the exact contents on the event and not the hardware event source. For example, say you receive a "read request completed" interrupt from a storage device. The kernel now needs to pass on the data to the process which originally requested it. In order to know how urgent the original request and thus the handling of the interrupt is, the kernel needs to check which sector was read and associate it with a process. Merely knowing that it came from a specific storage device is not sufficient.

By the way, NMI still exist on x86 to this day, but AFAIK they're only used for serious machine-level issues and watchdog timeouts.

wizzwizz4 4 days ago

This, too, can be done in hardware (if nothing else, with a small coprocessor).

refulgentis 4 days ago

This doesn't shed light

Generally, any given software can be done in hardware.

Specifically, we could attach small custom coprocessors to everything for the Linux kernel, and Linux could require them to do any sort of multitasking.

In practice, software allows us to customize these things and upgrade them and change them without tightly coupling us to a specific kernel and hardware design.

btilly 4 days ago

Exactly the point. We can compile any piece of software that we want into hardware, but after that it is easier to change in software. Given the variety of unexpected ways in which hardware is used, in practice we went up moving some of what we expected to do in hardware, back into software.

This doesn't mean that moving logic into hardware can't be a win. It often is. But we should also expect that what has tended to wind up in software, will continue to do so in the future. And that includes complex decisions about the priority of interrupts.

wizzwizz4 4 days ago

We already have specialised hardware for register mapping (which could be done in software, by the compiler, but generally isn't) and resolving instruction dependency graphs (which again, could be done by a compiler). Mapping interrupts to a hardware priority level feels like the same sort of task, to me.

sroussey 4 days ago

> We already have specialised hardware for register mapping (which could be done in software, by the compiler, but generally isn't)

Wait, what? I’ve been out of compiler design for a couple decades, but that definitely used to be a thing.

namibj 4 days ago

They're probably referring to AMD Zen's speculative lifting of stack slots into physical registers (due to x86, phased out with Zen3 though), and more generally to OoO cores with far more physical than architectural registers.

wizzwizz4 3 days ago

We do register allocation in compilers, yes, but that has surprisingly little bearing on the actual microarchitectural register allocation. The priority when allocating registers these days is, iirc, avoiding false dependencies, not anything else.

amluto 4 days ago

Linux runs actual C code when an event occurs — this is how it queues up a wake up of the target task and optionally triggers preemption.

RandomThoughts3 4 days ago

> Why isn't the level of preemption a property of the specific event, rather than of some global mode?

There are two different notions which are easy to get confused about here: when a process can be preempted and when a process will actually be preempted.

Potential preemption point is a property of the scheduler and is what is being discussed with the global mode here. More preemption points mean more chances for processes to be preempted at inconvenient time obviously but it also means more chances to properly prioritise.

What you call level of preemption, which is to say priority given by the scheduler, absolutely is a property of the process and can definitely be set. The Linux default scheduler will indeed do its best to allocate more time slices and preempt less processes which have priority.

biorach 4 days ago

Arguably PREEMPT_VOLUNTARY, as described in the article is an attempt in this direction which is being deprecated.

jabl 4 days ago

It's sort of what this patch does, from https://lwn.net/ml/all/20241008144829.GG14587@noisy.programm... :

> SCHED_IDLE, SCHED_BATCH and SCHED_NORMAL/OTHER get the lazy thing, FIFO, RR and DEADLINE get the traditional Full behaviour.

weinzierl 4 days ago

"Current kernels have four different modes that regulate when one task can be preempted in favor of another"

Is this about kernel tasks, user tasks or both?

GrayShade 4 days ago

Kernel code, user-space code is always preemptible.

fguerraz 4 days ago

Not true when the user-space thread has RT priority.

temac 4 days ago

RT threads can be prempted by higher prio RT, and IIRC some kernel threads run at the highest prio. Plus you can be prempted by SMI, an hypervisor, etc

simfoo 4 days ago

Can't find any numbers in the linked thread with the patches. Surely some preliminary benchmarking must have been performed that could tell us something about the real world potential of the change?

biorach 4 days ago

From the article, second last paragraph:

> There is also, of course, the need for extensive performance testing; Mike Galbraith has made an early start on that work, showing that throughput with lazy preemption falls just short of that with PREEMPT_VOLUNTARY.

spockz 4 days ago

How would you benchmark something like this? Run multiple processes concurrently and then sort by total run time? Or measure individual process wait time?

hifromwork 4 days ago

I guess both make sense, and a lot of other things (synthetical benchmarks, microbenchmarks, real-world benchmarks, best/average/worst case latency comparison, best/average/worst case throughput comparison...)