Sunday 10 April 2016

Understanding Of System Call Mechanism Using INT80

INTRODUCTION :
Operating systems offer processes running in User Mode a set of interfaces to interact with hardware devices such as the CPU, disks, and printers.
Putting an extra layer between the application and the hardware has several advantages.

Advantages :
1 > No need to study low-level programming
2 > Greatly increases system security (because kernel has ambition to check the accuracy of the request at the interface level before attempting to satisfy it)

Unix systems implement most interfaces between User Mode processes and hardware devices by means of system calls issued to the kernel. We will examine in detail how Linux implements system calls that User Mode programs issue to the kernel.

1. What are system calls?
----> System calls provide user-space processes a way to request services from the kernel.

Now, What kind of services?
----> Services which are managed by operating system like storage, memory, network, process management etc.
For ex. if a user process wants to read a file, it will have to make 'open' and 'read' system calls. Generally system calls are not called by processes directly. C library provides an interface to all system calls.

2. What happens in a system call?
-----> A kernel code snippet is run on request of a user process. This code runs in ring 0 (with current privilege level -CPL- 0), which is the highest level of privilege in x86 architecture. All user processes run in ring 3 (CPL 3).So, to implement system call mechanism, what we need is
1) A way to call ring 0 code from ring 3 and
2) Some kernel code to service the request.
(NOTE : all kernel code runs with privilege level 0and all user-space code runs with privilege level3)



To get more of it go through following url



3 . So, can I consider invoking sysem call is nothing but interrupting a kernel???
----> YES, It is.

4. How system calls to be identified in kernel?
-----> System calls are identified by their numbers. The number of the call fun() is __NR_fun. For example, the number of _getpid used above is __NR__getpid defined as 20 in /usr/src/linux-3.2.29/arch/x86/include/asm/unistd_32.h. Different architectures have different numbers.

Often, the kernel routine that handles the call getpid is called sys_getpid. One finds the association between numbers and names in the sys_call_table, for example in
/usr/src/linux-3.2.29/arch/x86/ia32/ia32entry.S

5. How this mechanism exactly works?
---> Linux uses software interrupt to implement a system call on all x86 platforms that is called as int80 interrupt which takes the control from user-space to kernel space.














Fig : Invoking a system call
NOTE : Here SYSCALL is invoked by int0x80.

Before going into the interrupt mechanism we should have a proper knowledge of registers being used in a processor.

So in a 32-bit architecture we have following general purpose registers :
EAX (%eax) ---> Accumulator
ECX (%ecx) ---> Counter
EDX (%edx) ---> Data
EBX (%ebx) ---> Base
ESP (%esp) ---> Stack Pointer
EBP (%ebp) ---> Stack base pointer(Frame Pointer)
ESI (%esi) ---> Source and
EDI (%edi) ---> Destination

(NOTE : Different architecture may posses different register.)

On i386, the parameters of a system call are transported via registers. The system call number goes into %eax, the first parameter in %ebx, the second in %ecx, the third in %edx, the fourth in %esi, the fifth in %edi, the sixth in %ebp.

To execute a system call, user process will copy desired system call number to %eax and will execute 'int 0x80'. This will generate interrupt 0x80 and an interrupt service routine will be called. For interrupt 0x80, this routine is an "all system calls handling" routine. This routine will execute in ring 0. This routine, as defined in the file /usr/src/linux/arch/i386/kernel/entry.S, will save the current state and call appropriate system call handler based on the value in %eax.

Now to get to know how exactly it happens, we will write a program to get pid of a process and
use gdb to find out the way system call is implemented.

Sample Program :

int main(void)
{
printf(“PID : %d\n”,getpid());
return 0;
}

Compile in a following way :
root@ges:/home# gcc -g getpid.c
Output :
root@ges:/home# ./a.out
PID : 25906

Use GDB to debug :

root@ges:/home# gdb -q a.out
Reading symbols from /home/a.out...done.
(gdb) b main
Breakpoint 1 at 0x8048455: file getpid.c, line 5.
(gdb) r
Starting program: /home/a.out
warning: Could not load shared library symbols for linux-gate.so.1.
Do you need "set solib-search-path" or "set sysroot"?

Breakpoint 1, main () at getpid.c:5
5 printf("PID : %d\n",getpid());
(gdb) b getpid
Breakpoint 2 at 0xb7ed3a00
(gdb) disassemble
Dump of assembler code for function main:
0x0804844c <+0>: push %ebp
0x0804844d <+1>: mov %esp,%ebp
0x0804844f <+3>: and $0xfffffff0,%esp
0x08048452 <+6>: sub $0x10,%esp
=> 0x08048455 <+9>: call 0x80482f0 <getpid@plt>
0x0804845a <+14>: mov %eax,0x4(%esp)
0x0804845e <+18>: movl $0x8048550,(%esp)
0x08048465 <+25>: call 0x80482e0 <printf@plt>
0x0804846a <+30>: mov $0x0,%eax
0x0804846f <+35>: leave
0x08048470 <+36>: ret
End of assembler dump.
(gdb) s
Breakpoint 2, 0xb7ed3a00 in getpid () from /lib/libc.so.6
(gdb) disassemble
Dump of assembler code for function getpid:
=> 0xb7ed3a00 <+0>: mov %gs:0x6c,%edx
0xb7ed3a07 <+7>: cmp $0x0,%edx
                     .
                     .
                     .
0xb7ed3a1a <+26>: lea 0x0(%esi),%esi
0xb7ed3a20 <+32>: jne 0xb7ed3a0e <getpid+14>
0xb7ed3a22 <+34>:mov $0x14,%eax # move system call number(20) to %eax
0xb7ed3a27 <+39>:int $0x80
0xb7ed3a29 <+41>:test %edx,%edx
0xb7ed3a2b <+43>:mov %eax,%ecx
0xb7ed3a2d <+45>:lea 0x0(%esi),%esi
0xb7ed3a30 <+48>:jne 0xb7ed3a0e <getpid+14>
0xb7ed3a32 <+50>: mov %ecx,%gs:0x68
0xb7ed3a39 <+57>: ret
End of assembler dump.
(gdb)

So after setting up the breakpoint at getpid() we can see in disassembly of the code that the system call number 0x14 has been moved to the EAX register(%eax) and interrupt has been generated using instruction int $0x80.So it concludes that system call is implemented using int x80 mechanism.

Thus, assembler instruction INT 0x80 causes a programmed exception and calls the kernel system_call routine.

9. Now how exactly int 0x80 takes our control to kernel space ????

----> Well, while booting up an IDT will be set up with 256 entries each four bytes long, total 1024 bytes, offsets 0-255. It should be noted that the IDT contains vectors to both interrupt handlers and exception handlers.
When start_kernel (found in /usr/src/linux/init/main.c) is called, it invokes trap_init (found in /usr/src/linux-3.2.29/arch/x86/kernel/traps.c). trap_init sets up the IDT via the macro set_intr_gate (found in /usr/include/asm/system.h) and initializes the interrupt descriptor table.

So for int0x80
=> set_system_trap_gate (SYSCALL_VECTOR, &system_call)

SYSCALL_VECTOR is defined in arch/x86/include/asm/irq_vectors.h

older version of linux (2.6 > )

ENTRY(system_call)
pushl %eax
SAVE_ALL
movl $0xffffe000, %ebx /* or 0xfffff000 for 4-KB stacks */
andl %esp, %ebx
cmpl $NR_syscalls, %eax
jb nobadsys
movl $(-ENOSYS), 24(%esp)
jmp resume_userspace


newer version of linux (2.6 <  )
/*
* syscall stub including irq exit should be protected against kprobes
*/
# system call handler stub
ENTRY(system_call)
RING0_INT_FRAME # can't unwind into user space anyway
pushl_cfi %eax # save orig_eax
SAVE_ALL # save all the user-space register in the kernel stack
GET_THREAD_INFO(%ebp) # system call tracing in operation / emulation
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp)
jnz syscall_trace_entry
cmpl $(nr_syscalls), %eax # compare the system call number present in the %eax
jae syscall_badsys # if not found bad system call
syscall_call:
call *sys_call_table(,%eax,4) # call particular subroutine
movl %eax,PT_EAX(%esp) # store the return value into the kernel stack

syscall_exit:
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_ANY) # make sure we don't miss an interrupt
TRACE_IRQS_OFF
movl TI_flags(%ebp), %ecx
testl $_TIF_ALLWORK_MASK, %ecx # current->work
jne syscall_exit_work

syscall_trace_entry:
movl $-ENOSYS,PT_EAX(%esp)
movl %esp, %eax
call syscall_trace_enter
/* What it returned is what we'll actually use. */
cmpl $(NR_syscalls), %eax
jnae syscall_call
jmp syscall_exit
END(syscall_trace_entry)

syscall_badsys:
movl $-ENOSYS,PT_EAX(%esp) # return with error number
jmp resume_userspace # resume to the userspace

Explanation : 
int $0x80 does the call transfer to the kernel entry point _system_call. This entry point is the same for all system calls. It is responsible for saving all registers, checking to make sure a valid system call was invoked, then ultimately transferring control to the actual system call code via the offsets in the _sys_call_table. It is also responsible for calling _ret_from_sys_call when the system call has been completed, but before returning to user space.

The next instruction the CPU executes after the int $0x80 is the pushl %eax in entry.S:system_call. There, we first save all user-space registers, then we check %eax and call sys_call_table[%eax], which is the actual system call. Next call to GET_THREAD_INFO() it will return the thread info structure for the given pointer as we can read at arch/x86/include/asm/thread_info.h.

/* how to get the thread information struct from ASM */
#define GET_THREAD_INFO(reg) \
movl $-THREAD_SIZE, reg; \
andl %esp, reg

That moves ‘THREAD_SIZE’ to ‘reg’ and then simply alignes ESP with that mask.

A validity check is then performed on the system call number passed by the User Mode process. If it is greater than or equal to the number of entries in the system call dispatch table, the system call handler terminates:

If the system call number is not valid, the function stores the -ENOSYS value in the stack location where the eax register has been saved—that is, at offset 24 from the current stack top. It then jumps to resume_userspace . In this way, whenthe process resumes its execution in User Mode, it will find a negative return code in eax .

Finally, the specific service routine associated with the system call number contained in eax is invoked:
call *sys_call_table(0, %eax, 4)
Because each entry in the dispatch table is 4 bytes long, the kernel finds the address of the service routine to be invoked by multiplying the system call number by 4, adding the initial address of the sys_call_table dispatch table, and extracting a pointer to the service routine from that slot in the table.

Exiting from system call:

older version of linux :
ENTRY(ret_from_sys_call)
movl %eax, 24(%esp)
cli
movl 8(%ebp), %ecx
testw $0xffff, %cx
je restore_all

newer version of linux:
/* Now return from system call */
mov %esp, %edx /* load kernel esp */
mov PT_OLDESP(%esp), %eax /* load userspace esp */
mov %dx, %ax /* eax: new kernel esp */
sub %eax, %edx /* offset (low word is 0) */
.
.
.
ENDPROC(system_call)

Explanation :
While returning from system call store the stack pointer of kernel stack, get the userspace stack and resume to the userspace.

Code for many of the system calls can be found in /usr/src/linux/kernel/sys.c. Code for the rest is distributed throughout the source files. Some system calls, like fork, have their own source file (e.g., kernel/fork.c).


Observations may vary based on linux version.
If any modification please feel free to modify and circulate.


References :
Understanding the linux kernel