Process Scheduling in the Kernel

Back to main page

Emanuele Altieri (
Prof. Nicholas Howe (
Smith College, June 2002


  1. The process descriptor
  2. Process lists
  3. Scheduling algorithms
  4. The scheduler function
  5. The goodness of a process

Questions: Q1, Q2, Q3, Q4, Q5, Q6

The Process Descriptor

In the Linux kernel, processes are defined as task_struct structures in include/linux/sched.h, line 281. This structure contains every relevant information about a process.

The first field in the task_structure structure identifies the state of a process:

volatile long state;	/* -1 unrunnable, 0 runnable, >0 stopped */

There are five possible states for a process, desribed at line 86 of sched.h. These states are the following:

#define TASK_RUNNING		0  /* 00000000b */
#define TASK_INTERRUPTIBLE	1  /* 00000001b */
#define TASK_UNINTERRUPTIBLE	2  /* 00000010b */
#define TASK_ZOMBIE		4  /* 00000100b */
#define TASK_STOPPED		8  /* 00001000b */

TASK_RUNNING identifies a process that is executing on a CPU, or waiting to be executed.

TASK_INTERRUPTIBLE identifies a process that is suspended (sleeping) until some condition becomes true. Raising an interrupt, releasing a system resource the process is waiting for, or delivering a signal are examples of conditions that might wake up the process, that is put its state back to TASK_RUNNNING.

TASK_UNINTERRUPTIBLE identifies a process that is suspended like in the TASK_INTERRUPTIBLE state, except that in this case delivering a signal will not wake up the process. This process state is seldom used.

TASK_STOPPED identifies a process whose execution has been stopped. The process enters this state upon delivery of a SIGSTOP, SIGTSTP, SIGTTIN or SIGTTOU signal (see signal(7)). Notice that the SIGSTOP signal cannot be ignored or caught. These signals are useful for debugging a program. Process execution can be continued by sending a SIGCONT signal to the process.

TASK_ZOMBIE. Process execution is terminated, but the parent process has not yet issued a wait()-like system call (see wait(2)) to return information about the dead process. Before the wait() is issued, the process cannot discard the data contained in the dead process descriptor because the parent could need it.

question Q1. Based on the definition of zombie, write a C program that creates a zombie. You can verify the existence of the zombie by printing the process list after the child process terminates. This is done by inserting the line system("ps -l") at the right point in your code. On success the ps command should output the following: (assume the program's name is a.out)
000 S   500  1589  1587  0  71   0    -   640 11cf00 pts/1    00:00:00 bash
000 S   500  1619  1589  1  69   0    -  2564 147d0d pts/1    00:00:00 emacs
000 S   500  1630  1589  0  74   0    -   338 11cf00 pts/1    00:00:00 a.out              Parent process
044 Z   500  1631  1630  0  73   0    -     0 11cda5 pts/1    00:00:00 a.out <defunct>    Child process
000 R   500  1632  1630  0  76   0    -   658      - pts/1    00:00:00 ps
Notice the Z state of the zombie process.

Also notice in the task_struct the process identifiers, such as the PID the PPID, and the process credentials, such as the user and the group to which the process belongs to (UID and GID):

uid_t uid,euid,suid,fsuid;
gid_t gid,egid,sgid,fsgid;
gid_t groups[NGROUPS];

The meaning of these fields is explained below:

uid, gidUser and group real identifiers
euid, egidUser and group effective identifiers
fsuid, fsgidUser and group effective identifiers for file access
groupsSupplementary group identifiers
suid, sgidUser and group saved identifiers

Real and effective identifiers are usually the same. They may differ only when a user executes a setuid program, that is, an executable file whose setuid flag is on. In this case, the effective user ID of the new process becomes the UID of the owner of the file. The following example shows how to set the setuid flag in a program:

[ealtieri@italia os]$ id
uid=500(ealtieri) gid=500(engineer) groups=500(engineer)
[ealtieri@italia os]$ ls
total 0
-rwx--x--x    1 ealtieri engineer        0 Jul  9 11:03 hello
[ealtieri@italia os]$ sudo chown root:root hello
[ealtieri@italia os]$ ls
total 0
-rwx--x--x    1 root     root            0 Jul  9 11:03 hello
[ealtieri@italia os]$ sudo chmod u+s hello 
[ealtieri@italia os]$ ls
total 0
-rws--x--x    1 root     root            0 Jul  9 11:03 hello

The chown command changes the owner and group of the file "hello" to root. The chmod command sets the setuid flag of the file (u stands for user, and +s turns on the setid flag). Notice the "s" character in the file permission codes of the last listing. At this point, every time the "hello" file is executed, the resulting process will run with root privileges (EUID=0), indipendently from the user who started the program.

question Q2. The getuid() and geteuid() functions retrieve the UID and EUID of the caller process. Use these two functions to implement a simple C program that prints out its own UID and EUID. Change the owner/group of the program to root and set its setuid flag as shown above. Run the program and report the output that you get.

The counter, nice, policy and processor fields are grouped together and are all involved in process scheduling, as explained later. The rt_priority field is also used for scheduling but it only applies to real-time processes.

Process Lists

To allow an efficient search through processes of a given type (for instance, all processes in a runnable state) the kernel creates several lists of processes. Each list consists of pointers to process descriptors. A list pointer (struct task_struct*) is embedded in the process descriptor's data structure.

A circular doubly linked list links together all existing process descriptors. This list is called process list. The prev_task and next_task fields of each process descriptor are used to implement this list. The head of the list is the init_task descriptor: it is the ancestor of all processes and it is called process 0 or swapper.

Process List

The Process List

The Linux kernel defines a useful macro to parse the process list:

for_each_task(p) { ... }

The macro expands to a for loop where, in every iteration, p points to a task descriptor of the process list. for_each_task() is defined in include/linux/sched.h, line 870:

#define for_each_task(p) \
	for (p = &init_task ; (p = p->next_task) != &init_task ; )

Notice that the head of the list, init_task, is skipped during the parsing because it does not correspond to any real process.

source The tasks.c module creates the /proc/tasks entry. Reading from this file will output the process list. Notice the use of the for_each_task() macro in the tsk_proc_read() function.

The list of running processes

The kernel keeps track of the runnable processes (those in the TASK_RUNNING state) using another circular, doubly-linked list called runqueue. The process descriptor include the next_run and prev_run fields to implement the runqueue list. As in the previous case, the init_task process descriptor plays the role of the header.

question Q3. Show how you would implement a loop that goes through the runqueue.
question Q4. Modify tasks.c so that it only shows the runnable processes. You can use the loop from the previous question, or add a few lines to the existing loop to check the process state at each iteration.

Process scheduling

Process scheduling is the part of an operating system that controls the order in which processes are executed and how long they can execute. A particular scheduling method is called a scheduling policy, and its implementation a scheduling algorithm. Linux offers three scheduling policies, defined in sched.h, line 115: SCHED_OTHER, SCHED_FIFO and SCHED_RR. All of these scheduling policies are preemptive: if a process with a higher priority gets ready to run, the current process will be preempted. The scheduling policy that is in effect for a process is given by the policy field of the process descriptor.


This is the default scheduling policy on Linux. The operating system assigns a certain number of clock ticks (quantum) to each process. This value is stored in the counter field of the process descriptor. The number of ticks assigned to each task is based on the nice property of the process:

p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);

where p is the process descriptor (task_struct) of the new task. The value of nice is usually 0 (default), which identifies a process with normal priority. nice can range from -20 (highest priority) to 19 (lowest priority). With nice<0, the value of counter will increase with respect to a normal priority process. With nice>0, the value of counter will decrease.

The "niceness" of a process can be seen with the ps command:

[ealtieri@italia ealtieri]$ ps -l
000 S   500  1694  1693  0  75   0    -   644 11cf00 pts/2    00:00:00 bash
000 S   500  2207  1694 12  69   0    -  2755 147d0d pts/2    00:00:00 emacs
000 R   500  2209  1694  0  78   0    -   662      - pts/2    00:00:00 ps

All of the processes above have a default (0) nice value. To change the niceness of emacs, for example, use the renice bash command:

[ealtieri@italia ealtieri]$ renice 10 -p 2207
2207: old priority 0, new priority 10
[ealtieri@italia ealtieri]$ ps -l
000 S   500  1694  1693  0  75   0    -   644 11cf00 pts/2    00:00:00 bash
000 S   500  2207  1694 12  69  10    -  2755 147d0d pts/2    00:00:00 emacs
000 R   500  2209  1694  0  78   0    -   662      - pts/2    00:00:00 ps

Only the super-user can decrease the niceness of a process.

At every clock tick the operating system decrements the counter field of the running process. This is done in kernel/timer.c, line 586:

void update_process_times(int user_tick)
        struct task_struct *p = current;   /* current running process */
        int cpu = smp_processor_id(), system = user_tick ^ 1;

        update_one_process(p, user_tick, system, cpu);
        if (p->pid) {
                if (--p->counter <= 0) {
                        p->counter = 0;
                        p->need_resched = 1;
                if (p->nice > 0)
                        kstat.per_cpu_nice[cpu] += user_tick;
                        kstat.per_cpu_user[cpu] += user_tick;
                kstat.per_cpu_system[cpu] += system;
        } else if (local_bh_count(cpu) || local_irq_count(cpu) > 1)
                kstat.per_cpu_system[cpu] += system;

When p->counter becomes 0, the running time of the current process is expired and another task is selected for running. The process will not be selected again for running until all of the other running processes have exhausted their time quantum. At this point the counter field of every process is recalculated according to the formula above.


This is a First-In-First-Out implementation of the scheduling algorithm. A process with this scheduling policy, when selected for running, will continue to use the CPU until either it is blocked by an I/O request, it is preempted by a higher priority process (FIFO or RR), or it calls sched_yield().

SCHED_FIFO tasks belong to the cathegory of the real-time (RT) processes. Real-time processes are in charge of critical tasks whose execution cannot be interrupted. These tasks are usually involved in multimedia processing. Because RT processes prevent any other SCHED_OTHER process from running, their life-time must be very short.

The priority of SCHED_FIFO processes is determined by the rt_priority field of the task descriptor ("rt" stands for real-time). This value ranges from 1 (lowest priority) to 99 (highest priority).


A Round Robin real-time process. Everything described for SCHED_FIFO also applies to SCHED_RR, except that each process is only allowed to run for a maximum time quantum. When the quantum expires, the RR process is moved to the end of the run-queue and another RR process is selected. The length of a quantum is given by NICE_TO_TICKS(p->nice).

The priority of a SCHED_RR process is given by rt_priority.

question Q5. Write a short C program that prints out the type of scheduling policy, niceness and real-time priority used by the process specified in the command line (as PID). Use sched_getscheduler(pid) to get the scheduling policy of a process, sched_getparam(pid) to get its real-time priority, and getpriority(PRIO_PROCESS, pid) to get the niceness of a process.

The scheduler function

On Linux, the scheduler() function is in charge of scheduling processes, that is deciding which process to run. It is defined in kernel/sched.c, line 549.

The function starts by copying the value of current into the prev pointer. current is the descriptor of the last running process.

asmlinkage void schedule(void)
        struct schedule_data * sched_data;
        struct task_struct *prev, *next, *p;
        struct list_head *tmp;
        int this_cpu, c;


        prev = current;
        this_cpu = prev->processor;

At this point the function checks if the scheduling policy of the current process is SCHED_RR, and if its time quantum has expired. If both these conditions ar true, the process quantum is recalculated and its descriptor moved to the end of the run-queue. This allows another round-robin process to be selected for running next. If the process descriptor weren't moved to the end of the run-queue list, the same process would have been selected for running again. This happens with the SCHED_FIFO policy.

        /* move an exhausted RR process to be last.. */
        if (unlikely(prev->policy == SCHED_RR))
                if (!prev->counter) {
                        prev->counter = NICE_TO_TICKS(prev->nice);

The proper scheduling process starts a bit futher in the function. The scheduling consists of a while loop which goes through all the processes in the run-queue. To each process in the queue is assigned a "goodness" value which determines how good a process is for running. This value is calculated by the goodness() function. The process with higher goodness will be the next process to run. If no process is available for running, then the operating system selects a special idle task.

        next = idle_task(this_cpu);
        c = -1000;
        list_for_each(tmp, &runqueue_head) {
                p = list_entry(tmp, struct task_struct, run_list);
                if (can_schedule(p, this_cpu)) {
                        int weight = goodness(p, this_cpu, prev->active_mm);
                        if (weight > c)
                                c = weight, next = p;

Notice from the code above that the order of processes in the run-queue is important. Given two or more processes with the same priority, the process that comes first in the queue will be selected for running. This is because the next pointer, which points to the task descriptor of the next process to run, is only changed if the goodness of process p is greater than any other goodness found up to that point in the loop.

It may happen that all of the processes in the run-queue have a goodness of zero. As we will see in the next section, this means that all of the processes in the queue have exhausted their quantum (SCHED_OTHER policy). If this happens, the quanta are recalculated and the processes reevaluated for running.

        /* Do we need to re-calculate counters? */
        if (unlikely(!c)) {
                struct task_struct *p;

                        p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
                goto repeat_schedule;

From this point on nothing can prevent the scheduler from switching to the next task, which is stored in the next pointer. The task switch is carried out by the following line at the end of the function:

        switch_to(prev, next, prev);

The goodness of a process

The goodness of a process is calculated by the goodness() function, in kernel/sched.c, line 144. This value indicates how good a process is for running.

The goodness() function can return one of the following values:

Return ValueDescription
-1000Never select this process
-1The process wants to yield
0The process is out of time (its quantum has expired). Recalculate its counter field.
0 < x < 1000Goodness value for a SCHED_OTHER process. The larger, the more desirable the process is for running.
> 1000Goodness value for a real time process (SCHED_FIFO or SCHED_RR)

Because the scheduler selects the process with highest goodness value for running, you can see from the return values above how real-time processes (SCHED_FIFO or SCHED_RR) will always be selected before normal processes (SCHED_OTHER).

The goodness() function starts by testing the SCHED_YIELD bit in the policy field of the process descritor. This bit is set by the sched_yield() function and indicates that the process wants to yield. In the code below, p is a pointer to the task descriptor of the process to be evaluated.

static inline int goodness(struct task_struct * p, int this_cpu, struct mm_struct *this_mm)
        int weight;

         * select the current process after every other
         * runnable process, but before the idle thread.
         * Also, dont trigger a counter recalculation.

        weight = -1;
        if (p->policy & SCHED_YIELD)
                goto out;

Following, the function calculates the goodness value in the case of a normal process (SCHED_OTHER). A first-approximation is calculated according to the number of ticks left in the process quantum. This means that the goodness of a process decreases with time, until it becomes zero when the time quantum of the task expires.

        if (p->policy == SCHED_OTHER) {
                 * Give the process a first-approximation goodness value
                 * according to the number of clock-ticks it has left.
                 * Don't do any other calculations if the time slice is
                 * over..
                weight = p->counter;
                if (!weight)
                        goto out;

Still in the SCHED_OTHER case, the final goodness is calculated in function of the niceness of the process. Recall that the value of nice can range from -20 (highest priority) to +19 (lowest priority).

                weight += 20 - p->nice;
                goto out;

        /* code for real-time processes goes here */

        return weight;

In the case of real-time processes, things are a bit simpler because the function skips the whole if statement described above and jumps directly to the following lines:

         * Realtime process, select the first one on the
         * runqueue (taking priorities within processes
         * into account).
        weight = 1000 + p->rt_priority;

The goodness of a real-time process does not depend on nice but on the rt_priority field of task descritor. The minimum value of rt_priority is 1, the maximum is 99. Therefore, the goodness of a real-time process ranges from 1001 to 1099.

question Q6. What is the order of execution for the following processes?
process_A = { policy: SCHED_FIFO, nice: 0, rt_priority: 1 }
process_B = { policy: SCHED_OTHER, nice: -20, rt_priority: 0 }
process_C = { policy: SCHED_RR, nice: 0, rt_priority: 5 }
process_D = { policy: SCHED_FIFO, nice: 0, rt_priority: 1 }
process_E = { policy: SCHED_OTHER, nice: 0, rt_priority: 0 }


kernelUnderstanding the Linux Kernel, by Daniel P. Bovet & Marco Cesati, published by O'Reilly.

	XHTML 1.0!   Powered by RedHat Linux