INTRODUCTION :
Operating systems offer processes
running in User Mode a set of interfaces to interact with hardware
devices such as the CPU, disks, and printers.
Putting an extra layer between the
application and the hardware has several advantages.
Advantages :
1 > No need to study low-level
programming
2 > Greatly increases system
security (because kernel has ambition to check the accuracy of the request at the
interface level before attempting to satisfy it)
Unix systems implement most interfaces
between User Mode processes and hardware devices by means of system calls issued
to the kernel. We will examine in detail how Linux implements system
calls that User Mode programs issue to the kernel.
1. What are system calls?
----> System calls provide
user-space processes a way to request services from the kernel.
Now, What kind of services?
----> Services which are managed by
operating system like storage, memory, network, process management etc.
For ex. if a user process wants to read
a file, it will have to make 'open' and 'read' system calls. Generally system
calls are not called by processes directly. C library provides an interface to all
system calls.
2. What happens in a system call?
-----> A kernel code snippet is run
on request of a user process. This code runs in ring 0 (with current privilege level -CPL- 0),
which is the highest level of privilege in x86 architecture. All user processes run in
ring 3 (CPL 3).So, to implement system call mechanism, what we need is
1) A way to call ring 0 code from ring 3 and
2) Some kernel code to service the
request.
(NOTE : all kernel code runs with
privilege level 0and all user-space code runs with privilege level3)
To get more of it go through
following url
3 . So, can I consider invoking
sysem call is nothing but interrupting a kernel???
----> YES, It is.
4. How system calls to be
identified in kernel?
-----> System calls are
identified by their numbers. The number of the call fun() is
__NR_fun. For example, the number of _getpid used above is
__NR__getpid defined as 20 in /usr/src/linux-3.2.29/arch/x86/include/asm/unistd_32.h. Different architectures have different
numbers.
Often, the kernel routine that handles
the call getpid is called sys_getpid. One finds the association
between numbers and names in the sys_call_table, for example in
/usr/src/linux-3.2.29/arch/x86/ia32/ia32entry.S
5. How this mechanism exactly works?
---> Linux uses software interrupt
to implement a system call on all x86 platforms that is called as
int80 interrupt which takes
the control from user-space to kernel space.
Fig : Invoking a system call
NOTE : Here SYSCALL is invoked by int0x80.
Before going into the interrupt
mechanism we should have a proper knowledge of registers being used in a processor.
So in a 32-bit architecture we have
following general purpose registers :
EAX (%eax) ---> Accumulator
ECX (%ecx) ---> Counter
EDX (%edx) ---> Data
EBX (%ebx) ---> Base
ESP (%esp) ---> Stack Pointer
EBP (%ebp) ---> Stack base
pointer(Frame Pointer)
ESI (%esi) ---> Source and
EDI (%edi) ---> Destination
(NOTE : Different architecture may
posses different register.)
On i386, the parameters of a system
call are transported via registers. The system call number goes into
%eax, the first parameter in %ebx, the second in %ecx,
the third in %edx, the fourth in %esi, the fifth in
%edi, the sixth in %ebp.
To execute a system call, user process
will copy desired system call number to %eax and will execute 'int
0x80'. This will generate interrupt 0x80 and an interrupt service
routine will be called. For interrupt 0x80, this routine is an "all
system calls handling" routine. This routine will execute in
ring 0. This routine, as defined in the file
/usr/src/linux/arch/i386/kernel/entry.S, will save the current state
and call appropriate system call handler based on the value in %eax.
use gdb to find out the way system call
is implemented.
Sample Program :
int main(void)
{
printf(“PID : %d\n”,getpid());
return 0;
}
Compile in a following way :
root@ges:/home# gcc -g getpid.c
Output :
root@ges:/home# ./a.out
PID : 25906
Use GDB to debug :
root@ges:/home# gdb -q a.out
Reading symbols from
/home/a.out...done.
(gdb) b main
Breakpoint 1 at 0x8048455: file
getpid.c, line 5.
(gdb) r
Starting program: /home/a.out
warning: Could not load shared library
symbols for linux-gate.so.1.
Do you need "set
solib-search-path" or "set sysroot"?
Breakpoint 1, main () at getpid.c:5
5 printf("PID : %d\n",getpid());
(gdb) b getpid
Breakpoint 2 at 0xb7ed3a00
(gdb) disassemble
Dump of assembler code for function
main:
0x0804844c <+0>: push %ebp
0x0804844d <+1>: mov
%esp,%ebp
0x0804844f <+3>: and
$0xfffffff0,%esp
0x08048452 <+6>: sub
$0x10,%esp
=> 0x08048455 <+9>: call
0x80482f0 <getpid@plt>
0x0804845a <+14>: mov
%eax,0x4(%esp)
0x0804845e <+18>: movl
$0x8048550,(%esp)
0x08048465 <+25>: call
0x80482e0 <printf@plt>
0x0804846a <+30>: mov
$0x0,%eax
0x0804846f <+35>: leave
0x08048470 <+36>: ret
End of assembler dump.
(gdb) s
Breakpoint 2, 0xb7ed3a00 in getpid ()
from /lib/libc.so.6
(gdb) disassemble
Dump of assembler code for function
getpid:
=> 0xb7ed3a00 <+0>: mov
%gs:0x6c,%edx
0xb7ed3a07 <+7>: cmp
$0x0,%edx
.
.
.
.
.
0xb7ed3a1a <+26>: lea
0x0(%esi),%esi
0xb7ed3a20 <+32>: jne
0xb7ed3a0e <getpid+14>
0xb7ed3a22 <+34>:mov
$0x14,%eax # move system call number(20) to %eax
0xb7ed3a27 <+39>:int
$0x80
0xb7ed3a29 <+41>:test
%edx,%edx
0xb7ed3a2b <+43>:mov
%eax,%ecx
0xb7ed3a2d <+45>:lea
0x0(%esi),%esi
0xb7ed3a30 <+48>:jne
0xb7ed3a0e <getpid+14>
0xb7ed3a32 <+50>: mov
%ecx,%gs:0x68
0xb7ed3a39 <+57>: ret
End of assembler dump.
(gdb)
So after setting up the breakpoint at
getpid() we can see in disassembly of the code that the system call
number 0x14 has been moved to the EAX register(%eax) and interrupt
has been generated using instruction int $0x80.So it concludes that
system call is implemented using int x80 mechanism.
Thus, assembler instruction INT 0x80
causes a programmed exception and calls the kernel system_call
routine.
9. Now how exactly int 0x80 takes
our control to kernel space ????
----> Well, while booting up an IDT
will be set up with 256 entries each four bytes long, total 1024
bytes, offsets 0-255. It should be noted that the IDT contains
vectors to both interrupt handlers and exception handlers.
When start_kernel (found in
/usr/src/linux/init/main.c) is called, it invokes trap_init
(found in /usr/src/linux-3.2.29/arch/x86/kernel/traps.c).
trap_init sets up the IDT via the macro set_intr_gate (found
in /usr/include/asm/system.h) and initializes the interrupt
descriptor table.
So for int0x80
=> set_system_trap_gate
(SYSCALL_VECTOR, &system_call)
SYSCALL_VECTOR is defined in
arch/x86/include/asm/irq_vectors.h
older version of linux (2.6 > )
ENTRY(system_call)
pushl %eax
SAVE_ALL
movl $0xffffe000, %ebx /* or
0xfffff000 for 4-KB stacks */
andl %esp, %ebx
cmpl $NR_syscalls, %eax
jb nobadsys
movl $(-ENOSYS), 24(%esp)
jmp resume_userspace
newer version of linux (2.6 < )
/*
* syscall stub including irq exit
should be protected against kprobes
*/
# system call handler stub
ENTRY(system_call)
RING0_INT_FRAME # can't
unwind into user space anyway
pushl_cfi %eax # save
orig_eax
SAVE_ALL #
save all the user-space register in the kernel stack
GET_THREAD_INFO(%ebp) # system
call tracing in operation / emulation
testl
$_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp)
jnz syscall_trace_entry
cmpl $(nr_syscalls), %eax #
compare the system call number present in the %eax
jae syscall_badsys # if not
found bad system call
syscall_call:
call *sys_call_table(,%eax,4)
# call particular subroutine
movl %eax,PT_EAX(%esp) #
store the return value into the kernel stack
syscall_exit:
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_ANY) #
make sure we don't miss an interrupt
TRACE_IRQS_OFF
movl TI_flags(%ebp), %ecx
testl $_TIF_ALLWORK_MASK, %ecx #
current->work
jne syscall_exit_work
syscall_trace_entry:
movl $-ENOSYS,PT_EAX(%esp)
movl %esp, %eax
call syscall_trace_enter
/* What it returned is what we'll
actually use. */
cmpl $(NR_syscalls), %eax
jnae syscall_call
jmp syscall_exit
END(syscall_trace_entry)
syscall_badsys:
movl $-ENOSYS,PT_EAX(%esp) # return
with error number
jmp resume_userspace # resume to
the userspace
Explanation :
int $0x80 does the call transfer to the
kernel entry point _system_call. This entry point is the same for all
system calls. It is responsible for saving all registers, checking to
make sure a valid system call was invoked, then ultimately
transferring control to the actual system call code via the offsets
in the _sys_call_table. It is also responsible for calling
_ret_from_sys_call when the system call has been completed, but
before returning to user space.
The next instruction the CPU executes
after the int $0x80 is the pushl %eax in entry.S:system_call. There,
we first save all user-space registers, then we check %eax and call
sys_call_table[%eax], which is the actual system call. Next call to GET_THREAD_INFO() it will
return the thread info structure for the given pointer as we can read
at arch/x86/include/asm/thread_info.h.
/* how to get the thread information
struct from ASM */
#define GET_THREAD_INFO(reg) \
movl $-THREAD_SIZE, reg; \
andl %esp, reg
That moves ‘THREAD_SIZE’ to ‘reg’
and then simply alignes ESP with that mask.
A validity check is then performed on
the system call number passed by the User Mode process. If it is greater than or
equal to the number of entries in the system call dispatch table, the system call handler
terminates:
If the system call number is not valid,
the function stores the -ENOSYS value in the stack location where the eax register
has been saved—that is, at offset 24 from the current stack top. It then jumps to
resume_userspace . In this way, whenthe process resumes its execution in
User Mode, it will find a negative return code in eax .
Finally, the specific service routine
associated with the system call number contained in eax is invoked:
call *sys_call_table(0, %eax, 4)
Because each entry in the dispatch
table is 4 bytes long, the kernel finds the address of the service routine to be invoked by
multiplying the system call number by 4, adding the initial address of the
sys_call_table dispatch table, and extracting a pointer to the service routine from that slot
in the table.
Exiting from system call:
older version of linux :
ENTRY(ret_from_sys_call)
movl %eax,
24(%esp)
cli
movl
8(%ebp), %ecx
testw
$0xffff, %cx
je
restore_all
newer version of linux:
/* Now return from system call */
mov %esp, %edx /* load
kernel esp */
mov PT_OLDESP(%esp), %eax /* load
userspace esp */
mov %dx, %ax /* eax: new
kernel esp */
sub %eax, %edx /* offset
(low word is 0) */
.
.
.
ENDPROC(system_call)
Explanation :
While returning from system call store
the stack pointer of kernel stack, get the userspace stack and resume
to the userspace.
Code for many of the system calls can
be found in /usr/src/linux/kernel/sys.c. Code for the rest is
distributed throughout the source files. Some system calls, like
fork, have their own source file (e.g., kernel/fork.c).
Observations may vary based on linux
version.
If any modification please feel free
to modify and circulate.
References :
Understanding the linux kernel