Task and Operation Management (Patches 20–23)
Overview
This group of four patches refines how sched_ext manages the lifecycle of individual tasks and tracks in-flight operations. The core implementation (patch 09) established the basic enqueue and dispatch mechanism, but several important task state transitions were underspecified: What exactly is the BPF program notified of when a task's CPU run begins or ends? Can the kernel suppress scheduling ticks when the BPF scheduler doesn't need them? What prevents races when the BPF program and kernel code concurrently touch the same task? And how does a CPU coordinate with another to ensure a kicked CPU has actually rescheduled?
Patches 20–23 answer each of these questions, completing the task lifecycle interface and adding the concurrency controls needed for robust BPF scheduler implementation.
Why These Patches Are Needed
Fine-Grained Task State Visibility
The base sched_ext implementation notifies BPF programs of two task events: ops.enqueue()
(task becomes runnable) and dispatch (task is placed on a CPU's local DSQ). But between
"task is runnable" and "task is actually running on a CPU" there is a gap that matters for
scheduling algorithms:
- Work-conserving schedulers need to know when a CPU actually begins executing a task (not just when the task was dispatched) to update utilization estimates.
- Gang schedulers need to know when a group of related tasks all become runnable simultaneously (they are waiting for all members to be ready before dispatching any).
- Preemption-aware schedulers need to know when a task stops executing, distinguishing between "stopped voluntarily" (syscall, I/O wait) and "stopped involuntarily" (preempted).
The base implementation does not expose these transitions to BPF programs.
Tick Suppression
Linux's scheduler tick fires periodically (typically every 1–4ms) on each CPU to implement
time-slicing. The tick calls scheduler_tick(), which checks whether the current task has
exhausted its time slice and sets TIF_NEED_RESCHED if so.
Many latency-sensitive workloads benefit from nohz_full (tickless operation): when a CPU
runs a single task with no pending work, the tick can be suppressed, eliminating jitter from
interrupt-driven preemption. CFS supports this natively. sched_ext needs the same capability.
In-Flight Operation Tracking
When a BPF program calls scx_bpf_dispatch(), the kernel places the task in a DSQ. But
between the BPF program deciding to dispatch a task and the kernel actually executing the
dispatch, the task's state might change: the task could be migrated, preempted, or even exit.
The kernel needs a mechanism to track tasks that are subjects of an in-flight SCX operation
and ensure the operation completes safely or is cancelled.
Synchronous Kick Confirmation
scx_bpf_kick_cpu() (patch 17) sends an IPI and returns immediately — it does not wait for
the target CPU to actually reschedule. For most use cases this is fine (fire and forget), but
for some coordination patterns, the calling CPU needs a guarantee that the target CPU has
processed the kick. Without this, race conditions are possible.
Key Concepts
PATCH 20 — Task Lifecycle Callbacks
Patch 20 adds four new sched_ext_ops callbacks that give BPF programs fine-grained visibility
into task state:
ops.runnable(p, enq_flags): Called when a task transitions from sleeping/waiting to
runnable. This happens just before ops.enqueue() is called. The distinction from
ops.enqueue() is subtle but important: runnable() is called once per sleep-to-wake
transition, while enqueue() may be called multiple times (e.g., after preemption and
re-queueing). BPF schedulers use runnable() to track the number of runnable tasks (for
load estimation) without double-counting re-queued tasks.
ops.running(p): Called immediately before a task begins executing on a CPU
(just before context_switch()). This is the notification that the task is now consuming CPU
cycles. BPF schedulers use this to start per-task CPU time accounting, update utilization
estimates, and record task.start_time for scheduling analytics.
ops.stopping(p, runnable): Called when a task is about to stop running on a CPU, before
it is removed from rq->curr. The runnable argument indicates whether the task will remain
runnable (true: preempted or yielded) or become blocked (false: waiting for I/O or sleeping).
This is the symmetric counterpart to ops.running().
ops.quiescent(p, deq_flags): Called when a task transitions from runnable to not runnable
(the task has gone to sleep, waiting for I/O, or exited). This is the symmetric counterpart to
ops.runnable(). BPF schedulers use this to decrement their runnable task count.
The relationship between these four callbacks traces a complete task lifecycle:
ops.runnable() ← task wakes up
ops.enqueue() ← task placed in a DSQ
ops.running() ← task begins executing on CPU
ops.stopping() ← task stops executing (may still be runnable)
ops.enqueue() ← (if still runnable: re-queued after preemption)
ops.quiescent() ← task goes to sleep
Beyond these four, patch 20 also adds:
ops.enable(p): Called when a task first adopts SCHED_EXT scheduling policy (or on
system init for existing tasks). This is where BPF programs initialize per-task state
(e.g., allocating a BPF_MAP_TYPE_TASK_STORAGE entry). Called from switching_to() (patch 04),
before the task is enqueued.
ops.disable(p): Called when a task leaves SCHED_EXT policy (either by calling
sched_setscheduler() to switch to another policy, or when the BPF scheduler is being disabled
and all tasks are returned to CFS). This is where BPF programs free per-task state.
The enable()/disable() pair guarantees balanced allocation/deallocation. The kernel ensures
disable() is called exactly once for each enable() call, even during error exits.
PATCH 21 — Tickless Support (nohz_full Integration)
When a BPF scheduler runs a task on a CPU and that CPU has no other runnable tasks, the
scheduler tick can be suppressed. Patch 21 integrates sched_ext with nohz_full:
The relevant function is task_tick_scx(rq, curr, queued), the sched_ext implementation of
sched_class->task_tick(). In CFS, this function checks whether the current task's time slice
has expired. In sched_ext, the tick is suppressed if all of the following are true:
- The CPU's local DSQ is empty (no other SCX tasks waiting).
- The global DSQ is empty.
- The BPF scheduler has not set any "tick needed" flag.
- The BPF scheduler's
ops.dispatch()would return immediately (no pending work).
When these conditions hold, task_tick_scx() calls sched_can_stop_tick(rq) to signal that
the tick can be stopped. The nohz_full infrastructure then suppresses the timer interrupt
on this CPU until a new task becomes runnable or the CPU is kicked.
Interaction with time slices: sched_ext tasks have a scx_entity.slice field that holds
the remaining time slice. When the tick fires and slice reaches zero, the task is preempted.
With tickless operation, preemption is driven by timer expiry rather than periodic ticks,
which may result in longer time slices than configured — this is intentional for nohz_full
workloads where jitter reduction is more important than strict time-slice enforcement.
BPF program responsibility: A BPF scheduler that dispatches tasks with scx_bpf_dispatch()
using SCX_SLICE_INF (infinite slice) is declaring that it will handle preemption itself (via
scx_bpf_kick_cpu() with SCX_KICK_PREEMPT). For such schedulers, tick suppression is
always appropriate. A scheduler that uses finite slices should not enable tick suppression
unless it can tolerate some slice overshoot.
PATCH 22 — scx_entity::in_op_task
This patch adds in_op_task, a flag in scx_entity that marks tasks that are currently the
subject of an in-flight SCX operation. The primary use case is scx_bpf_dispatch(): when the
BPF program calls scx_bpf_dispatch(p, dsq_id, slice, enq_flags), the kernel must:
- Validate that
pis still in a valid state for dispatch. - Acquire the appropriate locks (the task's DSQ lock, possibly the runqueue lock).
- Move the task to the target DSQ.
Between steps 1 and 2, the task could be migrated, preempted, or exit. Without in_op_task,
the kernel would have to check every condition again after acquiring the locks, and if the
state changed, it would have to abort and potentially corrupt the DSQ.
in_op_task works as a serialization token:
- It is set when a BPF callback begins processing a task (e.g., at the start of
ops.enqueue()). - It is cleared when the callback returns.
- Any kernel code path that needs to modify a task's scheduling state checks
in_op_taskfirst. If set, it waits (viascx_task_iter_wait()) for the in-flight operation to complete.
This avoids TOCTOU (Time Of Check, Time Of Use) races where the BPF program checks a task's state, decides to dispatch it, and the kernel changes the state before the dispatch completes.
Relationship to BPF verifier: The BPF verifier cannot prevent all races — it can prevent
BPF programs from accessing invalid memory, but it cannot reason about kernel-side state
changes. in_op_task is the kernel-side mechanism that complements the verifier's static
analysis by providing runtime serialization.
PATCH 23 — SCX_KICK_WAIT
Patch 23 adds the SCX_KICK_WAIT flag to scx_bpf_kick_cpu(). When specified:
- The kick is sent as usual (IPI to target CPU).
- The calling CPU then waits (via
wait_event()on a per-CPU completion) until the target CPU has completed one full scheduling round (i.e.,schedule()has returned and a new task has been selected).
Use case: A BPF scheduler that dispatches a task to CPU A's local DSQ and then needs to know that CPU A has actually picked up the task (not just received the kick) before proceeding. For example, a gang scheduler waiting for all CPUs in a gang to have their tasks running.
Implementation: The target CPU sets a completion event at the end of __schedule() when
it detects a pending SCX_KICK_WAIT. The calling CPU's scx_bpf_kick_cpu() waits on this
completion with a timeout. If the timeout expires (e.g., the target CPU is stuck), the wait
returns an error but does not trigger a watchdog event — it is the calling CPU's BPF program's
responsibility to handle the timeout.
Cost: SCX_KICK_WAIT adds synchronization overhead: the calling CPU blocks until the
target CPU reschedules. For high-frequency scheduling decisions, this is too expensive. It is
intended for coarse-grained coordination (gang startup, barrier-style synchronization) rather
than per-task dispatch.
Connections Between Patches
PATCH 20 (lifecycle callbacks)
└─→ ops.enable/disable: required foundation for per-task BPF state
└─→ ops.running/stopping: consumed by PATCH 21 (tick suppression) to determine
whether tick can be stopped based on actual execution state
└─→ ops.runnable/quiescent: enable accurate task count tracking
PATCH 21 (tickless)
└─→ Depends on ops.stopping (PATCH 20): tick is only suppressed after stopping
└─→ Interacts with scx_entity.slice: inf-slice tasks are primary beneficiaries
PATCH 22 (in_op_task)
└─→ Serializes against PATCH 20 callbacks: in_op_task prevents ops.disable() from
racing with ops.enqueue() for the same task
└─→ Interacts with PATCH 23: in-flight dispatch (in_op_task) and synchronous kick
(KICK_WAIT) together allow BPF programs to implement deterministic scheduling rounds
PATCH 23 (SCX_KICK_WAIT)
└─→ Extends PATCH 17 (scx_bpf_kick_cpu) with a synchronous variant
└─→ Used alongside PATCH 22 to confirm task was dispatched and CPU has rescheduled
What to Focus On
For a maintainer, the critical lessons from this group:
-
The enable/disable pairing guarantee.
ops.enable()andops.disable()must be called in perfectly matched pairs. The kernel's guarantee:disable()is called exactly once for everyenable(), even during error exit (SCX_EXIT_ERROR). When reviewing any change to the class transition or disable path, verify this invariant is preserved. A misseddisable()call leaks BPF per-task resources; a spuriousdisable()call on a task that never hadenable()called is undefined behavior. -
runnable/quiescent vs enqueue/dequeue. The new lifecycle callbacks form two parallel pairs:
runnable/quiescenttrack the task's runnable state (independent of CPU), whilerunning/stoppingtrack CPU execution. These are distinct events and must not be confused. A task that is preempted firesstopping(runnable=true)(it is still runnable) and will fireenqueue()again when re-queued, but does not firequiescent(). When reviewing BPF schedulers that implement these callbacks, verify they correctly distinguish preemption from blocking. -
Tick suppression and time-slice accuracy.
nohz_fullintegration means tick-based preemption may fire late. A BPF scheduler that relies on precise time slices (e.g., for real-time guarantees) should not use tickless mode. When reviewing changes to the tick path intask_tick_scx(), verify that finite-slice tasks still receive timely preemption. -
in_op_task as a liveness concern.
in_op_taskserializes kernel code against in-flight BPF operations. If a BPF program holdsin_op_taskindefinitely (e.g., infinite loop inops.enqueue()), kernel code waiting for the serialization will also wait indefinitely. The dispatch watchdog (patch 19) catches the infinite loop case, butin_op_taskitself has no timeout. When reviewing future uses ofin_op_task, ensure the in-flight operation is always bounded in time. -
SCX_KICK_WAIT and priority inversion. If CPU A calls
scx_bpf_kick_cpu(B, SCX_KICK_WAIT)and CPU B is running a high-priority RT task (which cannot be preempted by sched_ext), CPU A will wait until that RT task voluntarily yields. This is a form of priority inversion: the BPF scheduler on CPU A (which may be handling a high-priority SCX task) is blocked waiting for an RT task on CPU B to yield. Future changes toSCX_KICK_WAITshould consider adding a timeout that the BPF program can configure to avoid unbounded waits.