Deep Dive into Linux 0.12 System Call Mechanism

Kernel mode and user mode

Early OS engineers had to be careful because user programs could access other programs' addresses or even the OS address space, which could easily crash the entire system. Core resources such as memory, I/O ports, and privileged machine instructions must be protected and access-controlled.

Hardware vendors provide privilege-level support. For example, x86 divides CPU privilege into four rings: Ring0..Ring3. Ring0 has the highest privileges and can execute all instructions; Ring3 has the lowest privileges and cannot execute instructions that operate hardware resources directly, nor can code running in Ring3 access the Ring0 address space.

Segment registers (CS, DS, SS, ES, FS, GS) store segment selectors. A selector contains request privilege level (RPL/CPL). The CPU uses selectors to index the GDT or LDT and checks descriptor privilege level (DPL). Access is allowed only if DPL >= max{CPL, RPL}. CPL tracks the current CPU privilege level.

Linux reuses the hardware mechanism and uses Ring0 as kernel mode and Ring3 as user mode. Linux typically uses only Ring0 and Ring3 because the intermediate rings are rarely useful for operating system design and complicate portability to architectures with fewer privilege levels.

From a high level, Linux is divided into user mode (Ring3) and kernel mode (Ring0). Kernel mode has full access to hardware and memory and manages core resources; faults in kernel mode are catastrophic. User mode runs regular applications with limited privileges; faults are isolated to the process.

What is a system call?

When Linux boots, the CPU initially executes at the highest privilege level. The bootloader loads the kernel into memory and starts it (this discussion uses Linux 0.12 as an example). The kernel reserves part of memory for itself (kernel space) and leaves the main memory area for user-space applications.

After initialization, the system switches CPU execution to Ring3 and runs user-space programs. User-space code cannot directly access hardware or unrestricted memory. To perform privileged operations such as writing to disk or accessing I/O, a process must request the kernel to perform the operation by invoking a system call. A system call is the interface for user processes to interact with hardware and kernel services; it is a controlled way to switch from user mode to kernel mode.

How are system calls implemented?

The classic implementation in Linux 0.12 uses the int 0x80 software interrupt. The following steps summarize the flow:

User code invokes a library wrapper (for example, write).
The wrapper executes int 0x80 with the syscall number in EAX and arguments in EBX, ECX, EDX, etc.
The CPU transfers control to the kernel interrupt handler for vector 0x80, performs a stack and privilege switch, and the kernel dispatches to the corresponding syscall implementation.
On completion the kernel returns to user mode and the wrapper returns the syscall result to the caller.

Library wrapper: write

# lib/write.c #define __LIBRARY__ #include <unistd.h> /* Define the write wrapper using the syscall macro */ _syscall3(int, write, int, fd, const char *, buf, off_t, count)

Defining __LIBRARY__ enables inclusion of the inline assembly syscall macros in <unistd.h> so the wrapper is implemented as an inline syscall. The wrapper invokes the system interrupt with syscall number and arguments placed in registers.

Syscall inline-assembly macro (excerpt)

# include <unistd.h> #ifdef __LIBRARY__ #define __NR_write 4 /* syscall number for write */ /* Macro to define a syscall wrapper with 3 arguments */ #define _syscall3(type,name,atype,a,btype,b,ctype,c) type name(atype a,btype b,ctype c) { long __res; __asm__ volatile ("int $0x80" : "=a" (__res) : "0" (__NR_##name), "b" ((long)(a)), "c" ((long)(b)), "d" ((long)(c))); if (__res >= 0) return (type) __res; errno = -__res; return -1; } #endif /* __LIBRARY__ */ int write(int fildes, const char *buf, off_t count);

The inline assembly executes int $0x80. The syscall number __NR_name is placed in EAX; arguments go in EBX, ECX, EDX. On return, the result is read from EAX.

int 0x80 interrupt handler and dispatch

During early initialization the kernel installs the system call interrupt gate:

// kernel/sched.c (excerpt) void sched_init(void) { ... set_system_gate(0x80, &system_call); /* bind int 0x80 to system_call handler */ }

The interrupt handler for int 0x80 is implemented in assembly. It saves segment registers and general registers, sets DS/ES to kernel data segments, sets FS to the task's data segment (so FS can reference user-space), checks that EAX contains a valid syscall number, and then performs an indirect call through the syscall table:

// kernel/sys_call.s (excerpt) _system_call: push %ds push %es push %fs pushl %eax pushl %edx pushl %ecx pushl %ebx /* parameters EBX, ECX, EDX pushed for the C call */ movl $0x10, %edx /* set DS, ES to kernel data segment */ mov %dx, %ds mov %dx, %es movl $0x17, %edx /* FS to task local data segment (LDT) */ mov %dx, %fs cmpl _NR_syscalls, %eax jae bad_sys_call call _sys_call_table(,%eax,4) /* indirect call to syscall handler */ pushl %eax /* push syscall return value */ ... ret_from_sys_call: movl _current, %eax cmpl _task, %eax ...

_sys_call_table is an array of function pointers; the syscall number in EAX indexes this table. For example, __NR_write indexes the sys_write entry.

System call table (excerpt)

// include/linux/sys.h (excerpt) extern int sys_write(); fn_ptr sys_call_table[] = { sys_setup, sys_exit, sys_fork, sys_read, sys_write, sys_open, sys_close, sys_waitpid, sys_creat, sys_link, sys_unlink, sys_execve, sys_chdir, sys_time, sys_mknod, sys_chmod, sys_chown, sys_break, sys_stat, sys_lseek, sys_getpid, sys_mount, /* ... more entries ... */ }; int NR_syscalls = sizeof(sys_call_table) / sizeof(fn_ptr);

Execution of sys_write

The syscall dispatch invokes the kernel implementation of write, sys_write in fs/read_write.c:

// fs/read_write.c (excerpt) /* write syscall: write `count` bytes from `buf` to file descriptor `fd` */ int sys_write(unsigned int fd, char *buf, int count) { struct file *file; struct m_inode *inode; if (fd >= NR_OPEN || count < 0 || !(file = current->filp[fd])) return -EINVAL; if (!count) return 0; inode = file->f_inode; if (inode->i_pipe) return (file->f_mode & 2) ? write_pipe(inode, buf, count) : -EIO; if (S_ISCHR(inode->i_mode)) return rw_char(WRITE, inode->i_zone[0], buf, count, &file->f_pos); if (S_ISBLK(inode->i_mode)) return block_write(inode->i_zone[0], &file->f_pos, buf, count); if (S_ISREG(inode->i_mode)) return file_write(inode, file, buf, count); printk("(Write)inode->i_mode=%06o\n", inode->i_mode); return -EINVAL; }

For regular files sys_write calls file_write. That function performs block-level writes and copies data from user space into kernel buffers.

Kernel-user data exchange

Since kernel and user modes use different stacks and different address spaces, the kernel must safely access user memory. In Linux 0.12 the FS segment register is used to reference user data while DS/ES point to kernel data. The syscall handler sets DS/ES to the kernel data segment and FS to the task's data segment descriptor so the kernel can use FS-based addressing to access user-space buffers.

file_write and get_fs_byte

// fs/file_dev.c (excerpt) int file_write(struct m_inode *inode, struct file *filp, char *buf, int count) { off_t pos; int block, c; struct buffer_head *bh; char *p; int i = 0; if (filp->f_flags & O_APPEND) pos = inode->i_size; else pos = filp->f_pos; while (i < count) { /* obtain block, read block into bh */ c = pos % BLOCK_SIZE; p = c + bh->b_data; /* write start position */ bh->b_dirt = 1; c = BLOCK_SIZE - c; if (c > count - i) c = count - i; pos += c; if (pos > inode->i_size) { inode->i_size = pos; inode->i_dirt = 1; } i += c; while (c-- > 0) *(p++) = get_fs_byte(buf++); /* copy a byte from user space */ brelse(bh); } inode->i_mtime = CURRENT_TIME; if (!(filp->f_flags & O_APPEND)) { filp->f_pos = pos; inode->i_ctime = CURRENT_TIME; } return (i ? i : -1); }

get_fs_byte reads a byte from the FS-segmented user address; put_fs_byte writes a byte to a user address. Their implementations use inline assembly referencing the FS segment:

// include/asm/segment.h (excerpt) /* Read a byte from FS:address */ extern inline unsigned char get_fs_byte(const char *addr) { unsigned register char _v; __asm__("movb %%fs:%1, %0" : "=r" (_v) : "m" (*addr)); return _v; } /* Write a byte to FS:address */ extern inline void put_fs_byte(char val, char *addr) { __asm__("movb %0, %%fs:%1" :: "r" (val), "m" (*addr)); }

By setting FS to the user data segment inside the syscall entry and restoring registers on return, the kernel can efficiently copy data between user and kernel address spaces during the syscall without complex address arithmetic.