How is the system call in Linux implemented?

LinuxOperating System

Linux Problem Overview


When I invoke a system call in user mode,how did the call get processed in OS?

Does it invoke some some executable binary or some standard library?

If yes,what kind of thing it needs to complete the call?

Linux Solutions


Solution 1 - Linux

Have a look at this.

> Starting with version 2.5, linux > kernel introduced a new system call > entry mechanism on Pentium II+ > processors. Due to performance issues > on Pentium IV processors with existing > software interrupt method, an > alternative system call entry > mechanism was implemented using > SYSENTER/SYSEXIT instructions > available on Pentium II+ processors. > This article explores this new > mechanism. Discussion is limited to > x86 architecture and all source code > listings are based on linux kernel > 2.6.15.6.

  1. What are system calls? > > System calls provide userland > processes a way to request services > from the kernel. What kind of > services? Services which are managed > by operating system like storage, > memory, network, process management > etc. For example if a user process > wants to read a file, it will have to > make 'open' and 'read' system calls. > Generally system calls are not called > by processes directly. C library > provides an interface to all system > calls. >
  2. What happens in a system call? > > A kernel code snippet is run on > request of a user process. This code > runs in ring 0 (with current privilege > level -CPL- 0), which is the highest > level of privilege in x86 > architecture. All user processes run > in ring 3 (CPL 3). > > So, to implement system call mechanism, what we need is > > 1) a way to call ring 0 code from ring 3. > > 2) some kernel code to service the request. > >
  3. Good old way of doing it > > Until some time back, linux used to > implement system calls on all x86 > platforms using software interrupts. > To execute a system call, user process > will copy desired system call number > to %eax and will execute 'int 0x80'. > This will generate interrupt 0x80 and > an interrupt service routine will be > called. For interrupt 0x80, this > routine is an "all system calls > handling" routine. This routine will > execute in ring 0. This routine, as > defined in the file > /usr/src/linux/arch/i386/kernel/entry.S, > will save the current state and call > appropriate system call handler based > on the value in %eax. >
  4. New shiny way of doing it > > It was found out that this software > interrupt method was much slower on > Pentium IV processors. To solve this > issue, Linus implemented an > alternative system call mechanism to > take advantage of SYSENTER/SYSEXIT > instructions provided by all Pentium > II+ processors. Before going further > with this new way of doing it, let's > make ourselves more familiar with > these instructions.

Solution 2 - Linux

It depends on what you mean by system call. Do you mean a C library call (through glibc) or an actual system call? C library calls always end up using system calls in the end.

The old way of doing system calls was through a software interrupt, i.e., the int instruction. Windows had int 0x2e while Linux had int 0x80. The OS sets up an interrupt handler for 0x2e or 0x80 in the Interrupt Descriptor Table (IDT). This handler then performs the system call. It copies the arguments from user-mode to kernel-mode (this is controlled by an OS-specific convention). On Linux, the arguments are passed using ebx, ecx, edx, esi, and edi. On Windows, the arguments are copied from the stack. The handler then performs some sort of lookup (to find the address of the function) and executes the system call. After the system call is completed, the iret instruction returns to user-mode.

The new way is sysenter and sysexit. These two instructions basically do all the register work for you. The OS sets the instructions up through the Model Specific Registers (MSRs). After that it's practically the same as using int.

Solution 3 - Linux

It goes through glibc, which issues a 0x80 interrupt after filling registers with parameters. The kernel's interrupt handler then looks up the syscall in the syscall table and invokes the relevant sys_*() function.

Solution 4 - Linux

Vastly simplified, but what happens is an interrupt occurs when you try to access a reserved memory address. The interrupt switches the context to kernel mode and executes the kernel code (actual system call) on the user's behalf. Once the call is completed, control is returned to the user code.

Solution 5 - Linux

int X in assembly translates to a system call number n.
Ex read syscall may be given a number 4.
At the system startup, OS builds a table of pointers called interrupt descriptor table (IDT) which has a list of address for system calls along wit the privilege needed to execute them.
The Current Privilege Level(CPL) is saved in one of the bit of CS register(technically 2 bits on x86).
This are the steps followed by an int instruction:
• Fetch the n’th descriptor from the IDT, where n is the argument of int.
• Check that CPL in %cs is <= DPL, where DPL is the privilege level in the descriptor.
• If not then the user didn't have enough privilege to execute this and will result in an int 13 instruction (general protection fault) being executed,(user didnt have enough privilege)
• If yes then the user code has enough privilege to do this system call,the current execution context is saved ( registers etc), because we now switch to kernel mode.
The information includes registers,flags because when the system call is finsihed we want to continue execution from where we left. • The parameters to the system call are saved on the kernel stack, because system call are executed in kernel mode.

VSYSCALL ( FAST SYSTEM CALL)
Every time system call is executed by the user, the Os saves the current state of the machine(i.e the register, stack pointer etc) and switches to the kernel mode for execution. For some system call it is not necessary to save all the register. Ex gettime of day system call reads the current time and the system call returns. So some system calls are implemented through what are called vsyscalls. Here when a system call is made, it is executed in the user space itself without ever switching to the kernel. So time is saved.
See here for details on vsyscall http://www.trilithium.com/johan/2005/08/linux-gate/
and here https://stackoverflow.com/questions/7266813/anyone-can-understand-how-gettimeofday-works

Solution 6 - Linux

A syscall is made of a special trap instruction, a syscall number and arguments.

  1. The special trap instruction is used to switch from user mode to kernel mode which has unlimited privilege.
  2. The syscall number and arguments are passed by register.

Attributions

All content for this solution is sourced from the original question on Stackoverflow.

The content on this page is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Content TypeOriginal AuthorOriginal Content on Stackoverflow
QuestionMainIDView Question on Stackoverflow
Solution 1 - LinuxGregDView Answer on Stackoverflow
Solution 2 - Linuxwj32View Answer on Stackoverflow
Solution 3 - LinuxEduard - Gabriel MunteanuView Answer on Stackoverflow
Solution 4 - LinuxtvanfossonView Answer on Stackoverflow
Solution 5 - LinuxDeepthoughtView Answer on Stackoverflow
Solution 6 - LinuxChris TsuiView Answer on Stackoverflow