Chapter 5 System Calls

# Chapter 5 System Calls The kernel provides a set of interfaces by which processes running in user-space can interact with the system. ## Communicating with the Kernel In Linux, system calls are the only means user-space has of interfacing with the kernel; they are the only legal entry point into the kernel other than exceptions and traps. ## APIs, POSIX, and the C Library ![](https://i.imgur.com/hjPAdyS.png) ## Syscalls The C library, when a system call returns an error, writes a special error code into the global errno variable. This variable can be translated into human-readable errors via library functions such as perror(). For example, getpid() ``` SYSCALL_DEFINE0(getpid) { return task_tgid_vnr(current); // returns current->tgid } ``` SYSCALL_DEFINE0 is simply a macro that defines a system call with no parameters. ``` #define SYSCALL_DEFINE0(sname) \ SYSCALL_METADATA(_##sname, 0); \ asmlinkage long sys_##sname(void) asmlinkage long sys_getpid(void) ``` asmlinkage modifier * a directive to tell the compiler to look only on the stack for this function’s arguments. * This is a required modifier for all system calls. returns a long. * For compatibility between 32- and 64-bit systems, system calls defined to return an int in user-space return a long in the kernel. getpid() system call is defined as sys_getpid() in the kernel. * This is the naming convention taken with all system calls in Linux. * System call bar() is implemented in the kernel as function sys_bar(). ### System Call Numbers Each system call is assigned a unique syscall number to reference a specific system call. when syscall number is assigned: * It cannot change, or compiled applications will break. * if a system call is removed, its system call number cannot be recycled, or previously compiled code would aim to invoke one system call but would invoke another. The kernel keeps a list of all registered system calls in the system call table, stored in sys_call_table. This table is architecture; on x86-64 it is defined in arch/i386/kernel/syscall_64.c ### System Call Performance System calls in Linux are faster than in many other operating systems.This is partly because of Linux’s fast context switch times; entering and exiting the kernel is a streamlined and simple affair. ## System Call Handler 1. user-space applications signal to the kernel (software interrupt) The defined software interrupt on x86 is interrupt number 128, which is incurred via the int $0x80 instruction. It triggers a switch to kernel mode and the execution of exception vector 128, which is the system call handler. 2. Recently, x86 processors added a feature known as sysenter. This feature provides a faster, more specialized way of trapping into a kernel to execute a system call than using the int interrupt instruction. Regardless of how the system call handler is invoked, the important notion is that user-space causes an exception or trap to enter the kernel. ### Denoting the Correct System Call Simply entering kernel-space alone is not sufficient because multiple system calls exist, all of which enter the kernel in the same manner. Thus, the system call number must be passed into the kernel. 1. On x86, the syscall number is fed to the kernel via the **eax** register. 2. Before causing the trap into the kernel, user-space sticks in **eax** the number corresponding to the desired system call. 3. The system call handler then reads the value from **eax**. Other architectures do something similar. ![](https://i.imgur.com/ddQgr4s.png) ### Parameter Passing The easiest way: The **parameters** are stored in registers. On x86-32, the registers **ebx, ecx, edx, esi, and edi** contain, in order, the first five arguments. In the unlikely case of six or more arguments, a single register is used to hold a pointer to user-space where all the parameters are stored. The **return value** is sent to user-space also via register. On x86, it is written into the **eax** register. ## System Call Implementation ### Implementing System Calls **Design:** Design the system call to be as general as possible. Do not assume its use today will be the same as its use tomorrow.The purpose of the system call will remain constant but its uses may change. Is the system call portable? Do not make assumptions about an architecture’s word size or endianness. The Unix motto:“**Provide mechanism, not policy.**” ### Verifying the Parameters The kernel provides two methods for performing the requisite checks and the desired copy to and from user-space. Note kernel code must never blindly follow a pointer into user-space! One of these two methods must always be used. * For writing into user-space: copy_to_user() * For reading from user-space: copy_from_user() Both copy_to_user() and copy_from_user() may block. A final possible check is for valid permission * In older versions of Linux: suser() * The new system enables specific access checks on specific resources: capable() For example, * capable(CAP_SYS_NICE) checks whether the caller has the ability to modify nice values of other processes. * capable(CAP_SYS_REBOOT) checks whether the caller has the ability to reboot the system ## System Call Context The kernel is in **process context** during the execution of a **system call**. **In process context**, the kernel is **capable of sleeping** (for example, if the system call blocks on a call or explicitly calls schedule()) and is **fully preemptible**. * capable of sleeping: * The capability to sleep greatly simplifies kernel programming. * Interrupt handlers cannot sleep and thus are much more limited in what they can do than system calls * fully preemptible: * like user-space, the current task may be preempted by another task. * The system calls must be sure are reentrant. ### Final Steps in Binding a System Call After the system call is written, it is trivial to register it as an official system call: 1. Add an entry to the end of the system call table. For most architectures, the table is located in entry.S 2. For each supported architecture, define the syscall number in <asm/unistd.h>. ``` /arch/arm64/include/asm/unistd32.h #define __NR_open 5 __SYSCALL(__NR_open, compat_sys_open) #define __NR_close 6 __SYSCALL(__NR_close, sys_close) ``` 5. Compile the syscall into the kernel image (as opposed to compiling as a module). This can be as simple as putting the system call in a relevant file . Such as kernel/sys.c, which is home to miscellaneous system calls. ### Accessing the System Call from User-Space Linux provides a set of macros for wrapping access to system calls. These macros are **_syscalln()**: * sets up the register contents * issues the trap instructions * where **n** is between 0 and 6. The number corresponds to the number of parameters passed into the syscall because the macro needs to know how many parameters to push into registers. For example, consider the system call open(), defined as ``` long open(const char *filename, int flags, int mode) ``` The syscall macro to use this system call without explicit library support would be ``` #define __NR_open 5 _syscall3(long, open, const char *, filename, int, flags, int, mode) ``` Then, the application can simply call open(). ### Why Not to Implement a System Call The previous sections have shown that it is easy to implement a new system call, but that in no way should encourage you to do so. The pros: * System calls are simple to implement and easy to use. * System call performance on Linux is fast. The cons: * You need a syscall number, which needs to be officially assigned to you. * After the system call is in a stable series kernel, it is written in stone.The interface cannot change without breaking user-space applications. * Each architecture needs to separately register the system call and support it. * System calls are not easily used from scripts and cannot be accessed directly from the filesystem. * Because you need an assigned syscall number, it is hard to maintain and use a system call outside of the master kernel tree. * For simple exchanges of information, a system call is overkill. The alternatives: * Implement a device node and read() and write() to it. Use ioctl() to manipulate specific settings or retrieve specific information. * Certain interfaces, such as semaphores, can be represented as file descriptors and manipulated as such. * Add the information as a file to the appropriate location in sysfs. ## Conclusion We discussed what system calls are and how they relate to library calls and the application programming interface (API). We then looked at how the Linux kernel implements system calls and the chain of events required to execute a system call. We then went over how to add system calls and provided a simple example of using a new system call from user-space. Finally, we wrapped up the chapter with a discussion on the pros and cons of implementing system calls and a brief list of the alternatives to adding new ones. # System call linux kernel system call表 https://filippo.io/linux-syscall-table/ System call 是 process 與OS之間的介面，由Linux kernel實做出來給user使用，system call提供user programm和os溝通的界面，當user program需要os的服務時，user program便使用system call System call流程圖 ![](https://i.imgur.com/NAwrf6i.png) 流程： system call會伴隨一個trap(在Linux下會跳到int 0x80)，此時系統將mode bit由user mode改成 kernel mode(1->0)並查尋interrupt vector找尋相對應interrupt service routine(此時可做context switch 0->1) 執行完此routine發出interrupt告訴os已經完成 unistd.h PATH : /usr/src/linux/include/asm/unistd.h unistd.h 是一個重要的標頭檔，裡頭是 system call 編號的定義，當 system call 發生時，system call 的號碼將透過 register (EAX) 傳給 kernel。 system call 編號 ![](https://i.imgur.com/QpdSZBX.png) unistd.h 也定義了不同參數的 system call trigger，下面的程式碼是處理 2 個參數的 trigger 這是一個 macro，當遇到系統呼叫的時候，就會被展開。你可以看到他觸發 int 0x80號中斷 trap進kernel，並且把參數帶下去給system call handler處理 ![](https://i.imgur.com/dCL0SEz.png) 當從 shell 執行 link 時，0x80 號(第128號中斷)中斷向量會指到 system_call 進入點的位址, entry.S裡面放的是一堆kernel的exception handler system_call是exception的其中一種 ![](https://i.imgur.com/V62IPjz.png) 進到entry.S才真正進入的system call handler 這裡會把傳進來的system call number跟 NR_syscalls比較，如果大於等於 NR_syscalls則回傳非法的 -ENOSYS. ![](https://i.imgur.com/juuqVQ4.png) 正確的則觸發 *sys_call_table(,%eax, 4) ![](https://i.imgur.com/LxhDjms.png) system call做完會call __syscall_return離開 ![](https://i.imgur.com/pl4kfrh.png) # Parameter Passing 不只是system call number(用eax register來傳), 大多數syscalls都需要passing一個或多個參數，所以user-space必須把這些參數在trap的時候一併傳給kernel-space. 怎麼做? 很簡單用register來傳先後順序 1. ebx 2. ecx 3. edx 4. esi 5. edi return value : 用eax傳回如果多於6個參數怎麼辦? 用ioctl的方式，用某個register存一個指標傳給user-space User space碰eax register的問題 https://stackoverflow.com/questions/47692516/what-is-the-use-of-eax-register-in-the-context-of-system-calls-in-linux # Verifying the Parameters kernel必須嚴格檢查user-space傳下來的參數是否合法 * file I/O syscalls must check the file fd is valid * process related funcionts must check PID is valied 最重要的是檢查user-space給的pointer是否合法，想像一下如果一個process可以pass一個unchecked pointer給kernel, 甚至是passing一個pointer給沒有read權限的kernel，像是別的process的data or data mapped unreadable. 在跟去user-space的pointer之前，必須先確定幾件事 * The pointer points to a region of memory in user-space. Processes must not be able to trick the kernel into reading data in kernel-space on their behalf. * The pointer points to a region of memory in the process’s address space.The process must not be able to trick the kernel into reading someone else’s data * If reading, the memory is marked readable. If writing, the memory is marked writable. If executing, the memory is marked executable.The process must not be able to bypass memory access restrictions. kernel提供了兩種方法來做check跟收送資料(與userspace) 永遠不能直接使用userspace給的pointers * copy_to_user() * writing to user-space * ![](https://i.imgur.com/zEN4W03.png) * copy_from_user() * reading from user-space * ![](https://i.imgur.com/wCVuW85.png) 確認user-space pointer是否正確 ![](https://i.imgur.com/exy9eqI.png) ![](https://i.imgur.com/snkbfbl.png) _range_ok 其實等價於: * 如果(addr + size) >= (current_thread_info()->addr_limit) - 1 * 傳回非零值 * 如果(addr + size) < (current_thread_info()->addr_limit) * 傳回零簡單的說access_ok就是檢驗user-space pointer是否在當前process的address space裡面。詳細的網路解釋 https://blog.csdn.net/ce123/article/details/8454226