sched_ext Core Implementation (Patches 08–12)
Overview
This group contains the actual sched_ext implementation — the new scheduler class, its BPF interface, its safety infrastructure, and the first example schedulers. If the foundational refactoring patches (01–07) cleared the terrain, this group builds the structure. The cover letter (patch-08.md) contextualizes the submission as the v7 patchset, the version that was ultimately merged into Linux 6.11.
The five patches in this group are not equal in size or complexity. The boilerplate patch (patch-08) establishes the file skeleton and class registration. The core implementation patch (patch-09, ~4000 lines) is where almost all of the scheduling logic lives — it is the patch a maintainer must understand most deeply. Patches 10–12 add the safety envelope: example schedulers for verification, a sysrq escape hatch, and a watchdog timer.
Why These Patches Are Needed
A BPF-based scheduler class requires:
- A kernel-side implementation of
struct sched_classthat delegates decisions to BPF programs through a well-defined callback interface (struct sched_ext_ops). - Per-task state to track where each task sits in the scheduler's queues.
- Dispatch queues — the mechanism by which BPF programs move tasks to CPUs.
- BPF helper functions — the kernel-side API that BPF programs call to perform actions.
- Graceful failure — when the BPF program misbehaves, the kernel must recover without a panic.
- Example code that demonstrates the interface is usable and correctly designed.
This group delivers all six. Understanding why each piece exists requires understanding the problem sched_ext is solving: allow arbitrary scheduling policy in BPF while keeping the kernel safe from misbehaving BPF programs.
Key Concepts
The Cover Letter (patch-08.md) — v7 and the Merge Path
The v7 patchset was submitted after six earlier rounds of review. The cover letter is important reading for a maintainer because it documents the design decisions that were debated and resolved during review — the reasons certain choices were made over alternatives. It also announces that sched_ext would be merged into Linux 6.11 via the tip/sched/core tree.
Key design decisions the cover letter defends:
- Why
SCHED_EXTis a new scheduling policy (not a flag onSCHED_NORMAL). - Why BPF programs communicate through
sched_ext_opscallbacks rather than a ring buffer. - Why the dispatch queue abstraction (DSQ) exists as an intermediary rather than having BPF
programs directly set
curron CPUs.
The Boilerplate Patch — File Structure and Class Registration
The boilerplate patch creates the scaffolding that the core patch fills in:
include/linux/sched/ext.h— userspace-visible header:SCHED_EXTpolicy number,struct sched_ext_opsdeclaration (the BPF program fills this struct), and theSCX_DSQ_*constants.kernel/sched/ext.h— kernel-internal header:scx_entity(per-task state embedded intask_struct), internal DSQ structures, and declarations for functionscore.ccalls.kernel/sched/ext.c— the implementation file, initially stub functions.
The critical registration step: ext_sched_class is defined with .next = &idle_sched_class
and fair_sched_class.next is changed to &ext_sched_class. This inserts the new class at
the correct priority level. Hooks are added to kernel/fork.c (sched_ext_free() is called
from free_task()), kernel/sched/core.c (for class transitions), and kernel/sched/idle.c
(so idle CPUs check the SCX global DSQ).
PATCH 09 — The Core Implementation
This is the heart of sched_ext. The key data structures and mechanisms are:
struct scx_dispatch_q (DSQ)
A DSQ is an ordered list of tasks waiting to run. There are three kinds:
SCX_DSQ_GLOBAL(ID 0): A single global FIFO queue. Any CPU can dispatch from it. This is the simplest possible dispatch model — a BPF scheduler that only uses the global DSQ is functionally similar to a simple FIFO scheduler.SCX_DSQ_LOCAL(IDSCX_DSQ_LOCAL_ON | cpu): Per-CPU local queues. Tasks in a CPU's local DSQ will run on that specific CPU. The kernel dispatches from the local DSQ before looking elsewhere.- User-defined DSQs: BPF programs can create arbitrary DSQs with
scx_bpf_create_dsq(id, node). These allow BPF programs to implement complex queuing policies (priority queues, round-robin groups, deadline queues) entirely in BPF.
struct scx_entity
Embedded in task_struct (alongside struct sched_entity se for CFS). Tracks:
- Which DSQ the task is currently in (
dsqpointer). - The task's position in the DSQ (
dsq_node). - Slice remaining (
slice— how many nanoseconds the task can run before it's preempted). - Virtual time for vtime-ordered DSQs (
dsq_vtime). - Various flags:
SCX_TASK_QUEUED,SCX_TASK_RUNNING,SCX_TASK_DISALLOW, etc.
The dispatch path
When the kernel needs to pick the next task for a CPU, pick_next_task_scx() is called:
- Check the CPU's local DSQ (
rq->scx.local_dsq). If non-empty, take the first task. - Call
ops.dispatch(cpu, prev)— give the BPF program a chance to fill the local DSQ. - After
ops.dispatch()returns, check the local DSQ again. - Check
SCX_DSQ_GLOBALas a fallback. - If still empty, return NULL (kernel will try
idle_sched_class).
The enqueue path
When a task becomes runnable, enqueue_task_scx() is called:
- Call
ops.select_cpu(p, prev_cpu, wake_flags)— BPF program can return a preferred CPU. - If the preferred CPU's local DSQ is empty and the CPU is idle, dispatch there directly
(this is the "direct dispatch" optimization that avoids an extra
ops.dispatch()round). - Otherwise, call
ops.enqueue(p, enq_flags). The BPF program must callscx_bpf_dispatch(p, dsq_id, slice, enq_flags)to place the task in a DSQ.
BPF helper functions (scx_bpf_*)
These are the kernel-side API that BPF programs call:
scx_bpf_dispatch(p, dsq_id, slice, enq_flags)— move taskpto DSQdsq_idwith time sliceslice.scx_bpf_dispatch_vtime(p, dsq_id, vtime, slice, enq_flags)— dispatch with virtual time ordering.scx_bpf_consume(dsq_id)— inops.dispatch(), pull the next task from a user DSQ into the local DSQ.scx_bpf_create_dsq(dsq_id, node)— create a new DSQ.scx_bpf_destroy_dsq(dsq_id)— destroy a DSQ (BPF programs clean up inops.exit()).scx_bpf_task_running(p)— is the task currently executing on a CPU?scx_bpf_cpu_rq(cpu)— get thestruct rq *for a CPU (used to enqueue to local DSQs).
Error exit (scx_ops_error())
If the BPF program misbehaves (e.g., tries to dispatch to a non-existent DSQ, returns an
invalid CPU from select_cpu, or causes a kernel BUG), the kernel calls scx_ops_error().
This:
- Sets an error state in
scx_ops_exit_kind. - Schedules
scx_ops_disable_workfn()on a work queue. - The work function disables the BPF scheduler: moves all tasks back to CFS, calls
ops.exit(), and clearsext_sched_classfrom the class list.
The error state captures a human-readable reason string and is readable from debugfs, which feeds the monitoring tooling in later patches.
PATCH 10 — Example Schedulers
Two example BPF schedulers are provided in tools/sched_ext/:
scx_simple: The simplest possible sched_ext scheduler. It implements only ops.enqueue()
and ops.dispatch(). Every task is dispatched to SCX_DSQ_GLOBAL. This is the "hello world"
of sched_ext and demonstrates the minimum viable implementation.
scx_example_qmap: A more sophisticated scheduler using 5 per-priority queues implemented
as a BPF array map. ops.enqueue() places tasks in the queue corresponding to their nice value.
ops.dispatch() drains queues in priority order. This demonstrates:
- How to create and manage BPF-defined DSQs.
- How
scx_bpf_consume()is used to pull from custom DSQs into the local DSQ. - How per-task data can be stored in BPF maps keyed by
task_struct *.
These examples are not just demos — they are the primary validation that the API is usable and that the kernel-side implementation correctly handles the BPF callbacks.
PATCH 11 — sysrq-S Emergency Fallback
The keyboard shortcut Alt+SysRq+S is registered to call scx_ops_disable(). This provides
a human-accessible escape hatch: if a BPF scheduler causes the system to become unresponsive
(tasks not scheduled, UI frozen), a user at the console can press Alt+SysRq+S to forcibly
kill the BPF scheduler and return all tasks to CFS.
Mechanically, this calls scx_ops_error() with a "sysrq" reason, which triggers the same
disable path as an internal error. The distinction is in the exit reason recorded: SCX_EXIT_SYSRQ
vs SCX_EXIT_ERROR. This is important for post-mortem analysis — was the scheduler disabled
intentionally or due to a bug?
The sysrq handler is registered in scx_init() and deregistered in scx_exit().
PATCH 12 — Watchdog Timer
The watchdog timer addresses a failure mode that sysrq cannot catch: the system is technically running but the BPF scheduler is not scheduling tasks in a timely manner. For example, a BPF program might have a bug where certain tasks are never dequeued from a user-defined DSQ, starving them indefinitely.
The watchdog implementation:
- A per-CPU
struct scx_watchdogcontains ahrtimerand tracks when the CPU's SCX tasks were last dispatched. - The timer fires every
scx_watchdog_timeout / 2(default: 15 seconds, half the 30-second timeout). - On each fire, it checks whether any runnable SCX task has not been scheduled within
scx_watchdog_timeout. - If a stall is detected,
scx_ops_error()is called with "runnable task stall detected".
The watchdog uses a generation counter per task (scx_entity.runnable_at) that is updated
each time a task is dispatched. The watchdog checks whether now - runnable_at > timeout for
any runnable task.
The watchdog is enabled when scx_ops_enable() succeeds and disabled as part of
scx_ops_disable_workfn(). It cannot fire during class transitions, avoiding a window
where a task might appear stalled simply because it is being migrated.
Connections Between Patches
The patches in this group build on each other in a specific order:
PATCH 08 (boilerplate)
└─→ Creates the files and stubs that PATCH 09 fills in
└─→ Registers ext_sched_class, making the class visible to the scheduler
PATCH 09 (core)
└─→ Implements the full dispatch/enqueue/select_cpu machinery
└─→ Provides scx_ops_error() which PATCH 11 and PATCH 12 use
└─→ Provides scx_ops_enable/disable which PATCH 12's watchdog wraps
PATCH 10 (examples)
└─→ Validates that PATCH 09's API is correct and complete
└─→ The simplest test that the dispatch path works end-to-end
PATCH 11 (sysrq)
└─→ Uses scx_ops_error() from PATCH 09
└─→ Provides human-accessible escape hatch tested before watchdog
PATCH 12 (watchdog)
└─→ Uses scx_ops_error() from PATCH 09
└─→ Closes the failure mode that PATCH 11 cannot catch (no human present)
Connections to Foundational Patches
This group directly consumes every change from patches 01–07:
sched_class_above()(PATCH 01) is called inext.cwherever class priority comparisons occur.- Fallible fork (PATCH 02) enables
scx_cgroup_can_attach()andops.enable()during fork to propagateENOMEM. reweight_task()(PATCH 03) is implemented inext.cto callops.reweight_task().switching_to()(PATCH 04) is implemented inext.cto callops.enable()before enqueue.- Cgroup/PELT helpers (PATCHES 05–06) are called from
scx_bpf_task_cgroup_weight()and the PELT integration inscx_entity. normal_policy()(PATCH 07) is used in several places inext.cto check task policy.
What to Focus On
For a maintainer, the critical areas to understand in this group:
-
The dispatch contract. The rule is: a task must end up in a DSQ (via
scx_bpf_dispatch()) before the kernel can run it. Ifops.enqueue()is called and the BPF program does not callscx_bpf_dispatch(), the task is considered to have been "consumed" without being dispatched, which triggers an error exit. Understanding this contract is necessary to evaluate any future change to the enqueue/dispatch path. -
DSQ lifecycle. User-defined DSQs are created with
scx_bpf_create_dsq()and must be destroyed inops.exit(). If a BPF program forgets to destroy a DSQ, the kernel detects leaked DSQs duringscx_ops_disable_workfn()and logs an error. Future patches touching DSQ lifecycle (e.g., per-NUMA DSQs, per-cgroup DSQs) must respect this cleanup contract. -
The error exit machinery.
scx_ops_error()can be called from any context — interrupt, softirq, process context — so it must be lock-free in its initial steps. The actual disable work is deferred to a workqueue. Understanding this deferred-disable pattern is essential for reviewing any change that adds new error conditions. -
scx_entityintask_struct. Adding fields totask_structis always scrutinized heavily in kernel review because it increases memory usage for every task on the system, not just SCX tasks. Study howscx_entityis conditionally compiled (CONFIG_SCHED_CLASS_EXT) and what techniques are used to minimize its footprint. -
The select_cpu / direct dispatch optimization. The path where
ops.select_cpu()returns a CPU whose local DSQ is empty, and the task is dispatched there without going throughops.enqueue()at all, is a critical performance optimization. Any change to wakeup paths must preserve this fast path or explicitly justify removing it. -
Example schedulers as specification. The example schedulers in PATCH 10 are not just documentation — they are the canonical test of whether a proposed API change is usable. When reviewing future sched_ext API changes, check whether the examples would need to change and whether the changes make the examples simpler or more complex.