Linux Operation System

# Linux Operation System ## Introduction * IA-32(Intel Architecture 32-bit) = i386, x86 * IA-32e = extension ### Real mode and Protected mode * Real Mode: * 16-bit registers * while power up or reset * addressing: `segment register × 16 + offset → physical address` * Protected Mode: * 16/32-bit registers * operating system * addressing:　 ` logical address -> Segmentation Unit -> linear address -> Paging Unit -> Physical address ` ### real-mode Interrupt Vector Table(IVT) The IVT contains 256 real-mode pointers for all of the **real-mode Interrupt Service Routines (ISRs)**. ``` 16bit offset + 16bit segment 0 0x0000 [[offset][segment]] 1 0x0004 [[offset][segment]] 2 0x0008 [[offset][segment]] ... ... ... 255 0x03FC [[offset][segment]] ``` ### Switch to Protected Mode 1. load GDTR with the pointer to the GDT-table 2. disable interrupts ("cli") 3. load IDTR with the pointer to the IDT 4. set the PE-bit in the CR0 or MSW register 5. make a far jump to the code to flush the PIQ **Prefetch Input Queue (PIQ)**: pre-loading machine code from memory into this queue 6. initialize TR with the selector of a valid TSS 7. optional: load LDTR with the pointer to the LDT-table ### IA-32e Mode/Long Mode compatibility mode + 64-bit mode * Real Mode to Protected Mode: PE flag in CRO * Protect Mode to IA-32e Mode: 1. enabling paging 2. setting the LME bit **( IA32_EFER.LME[bit 8] )** > IA32_EFER > On systems that support IA-32e mode (i.e. long mode), the extended feature enable register (IA32_EFER) is available. ![](https://i.imgur.com/aGyjycK.png) ### Endian Order * Little Endian: Intel processors * Big Endian: Motorola processors (Mac) ### Linux Source Code Tree * Makefile * Documentation/ * arch/ All the **architecture specific code** is in this directory and in the include/asm-arch directories * drivers/ **code to run peripheral devices** ex: video drivers, network card drivers, low-level SCSI * fs/ * include/ > Part of the kernel build process creates the symbolic link from asm to asm-arch, so that #include <asm/file.h> will get the proper file for that architecture without having to hard code it into the .c file . * init/ 1. **version.c** defines the Linux version string 2. **main.c** can be thought of as the kernel "glue." * ipc/ contains the code for shared memory, semaphores, and other forms of IPC * kernel/ Generic kernel level code that doesn't fit anywhere else goes in here * lib/ Routines of generic usefulness to all kernel code are put in here ex: Common string operations, debugging routines, and command line parsing code * mm/ High level memory management code * net/ high-level networking code * scripts/ ### The init Process The kernel thread for **process 1** is created by invoking the kernel_thread( ) function to execute kernel function **kernel_init** In turn, this kernel thread executes the /sbin/init program. ### Synchronization * Atomic Operation: a single, non-interruptible operation >　"原子操作(atomic operation)是不需要synchronized"，這是多線程編程的老生常談了。所謂原子操作是指不會被線程調度機制打斷的操作；這種操作一旦開始，就一直運行到結束，中間不會有任何context switch * Non-preemptive Kernels: When a process executes in kernel mode, it cannot be arbitrarily **suspended** and **substituted with another process** Ineffective in multiprocessor system * Interrupt Disabling: Disabling interrupts before entering critical region and restoring the interrupts after leaving the region Not suitable for multiprocessors * Semaphore * an integer variable, * a list of waiting processes * two atomic methods down() and up(). Will block process; therefore, it is not suitable for interrupt handler * Spin Lock * Avoid deadlock ### Signal #### Processes' Responses to Signals 1. ignore 2. Asynchronously execute a signal handler #### Kernel Default Actions to Signals When a process doesn’t define its response to a signal, then kernel will utilize the default action of the signal to handle it ## Memory Addressing * Logical Address: Used in machine language instructions to specify the address of an instruction or an operand = segment base address + offset * offset: the distance from the start of the segment to the actual address * In an assembly language instruction, the segment base address part is stored in a segment register and is usually omitted, because most segments are specified by default segment registers: ``` mov es:[eax],ecx ≡ 268908 mov [eax],ecx ≡ 8908 ``` * Linear Address (Virtual Address): In an IA-32 architecture, it is an unsigned 32-bit integer From 0x00000000 to 0xffffffff -> 4G * 0x00000000 to 0xbfffffff User or kernel Mode * 0xc0000000 to 0xffffffff kernel Mode * Physical address: Used to address memory cells in memory chips. Physical addresses are also represented by a 32-bit unsigned integer ### Address Transformation * Segmentation Unit * A hardware circuit * Transform a logical address into a virtual address * Paging Unit * A hardware circuit * Transform a virtual address into a physical address ### Segmentation Unit **A logical address is decided by** 1. a 16-bit segment selector (segment identifier) 2. a 32-bit offset within the segment identified by the segment selector #### Segment Registers Each segment register holds a segment selector * cs: points to a code segment * ss: points to a stack segment * ds: points to a data segment * es, fs, gs: general purpose segment register may point to arbitrary data segments #### Segment selector ![](https://i.imgur.com/PlPqH0R.png =60%x) * address=(gdtr/ldtr) + index*8 * The first entry of the GDT is always 0. * The maximum number of segment descriptors that the GDT can have is 213-1. * The cs register’s RPL also denotes the current privilege level of the CPU #### CPU Privilege levels The cs register includes a 2-bit field that specifies the **Current Privilege Level (CPL)** of the CPU. * 0 for kernel mode * 3 for user mode #### Segment Descriptors Each segment is represented by an 8-byte Segment Descriptor that describes the segment characteristics ![](https://i.imgur.com/egiO1Jf.png =60%x) * Base field (32): the linear address of the first byte of the segment. * G granularity flag (1): 0 (byte); 1 (4K bytes). * Limit field (20). * S system flag (1): 0 (system segment); 1 (normal segment). * Type field (4): segment type and its access rights. * DPL (Descriptor privilege level) (2): * Segment-present flag * D/B flag D/B 在不同的狀況下，有不同的意義。當 segment 是一個可執行的程式碼的 segment，則這個旗標叫 D。D = 0 表示在這個 segment 中的程式內定使用 16-bit 的位址，而 D = 1 表示在這個 segment 中的程式內定使用 32-bit 的位址。若 segment 是一個堆疊或是資料的 segment ，則這個旗標叫 B。B = 0 表示這是一個 16-bit 的 segment，最大值為 FFFFH，而 B = 1 則表示這是一個 32-bit 的 segment，最大值為 FFFFFFFFH * Reserved bit * AVL flag #### Frequently Used Segment Descriptor Types * Code Segment Descriptor * Data Segment Descriptor Stack Segments are implemented by means of Data Segment Descriptors * Task State Segment Descriptor (TSSD) A TSSD describes a Task State Segment (TSS) which is used to store the contents of a process registers * Local Descriptor Table Descriptor (LDTD) #### gdtr & ldtr ![](https://i.imgur.com/8grLNyk.png =60%x) The base address specifies the linear address of byte 0 of the GDT; the table limit specifies the number of bytes in the table. #### Translation of a Logical Address ![](https://i.imgur.com/b5gThT5.png =60%x) #### check privilege level When accessing a segment, there are actually two checks that must be performed 1. CPL <= DPL 2. RPL <= DPL ![](https://i.imgur.com/UZIXMRl.png =60%x) ### The Paging Unit #### Enable Paging * setting the PG flag of the control register cr0 * When PG flag=0, a virtual address is equal to a physical address. * Paging mechanism is used in protected mode #### Division of a Virtual Address ![](https://i.imgur.com/uG3xTJV.png =80%x) **Directory field**: The Directory field within the virtual address determines the entry in the Page Directory that points to the proper page table 1. 2^10 entries in a Page Directory 2. Because each entry’s size is 4 bytes; a Page Directory uses 4 KB **Table field** The address’s Table field, in turn, determines the entry in the Page Table that contains the physical address of the page frame containing the page 1. 2^10 entries in a Page Table 2. Because each entry’s size is 4 bytes; a Page Table uses 4 KB **offset field** The offset field determines the relative position within the page frame Each page frame consists of 4096 (i.e. 212) bytes of data ![](https://i.imgur.com/AclWf2U.png =60%x) ![](https://i.imgur.com/EOdsGSP.png =60%x) ## Process an instance of a program in execution. ### Lightweight Processes In Linux a **thread group** is basically a set of lightweight processes that 1. implement a multithreaded application 2. act as a whole with regards to some system calls such as * getpid() * kill() * _exit() ### Process State It consists of an array of flags, each of which describes a possible process state * TASK_RUNNING: * executing on a CPU * waiting to be executed * TASK_INTERRUPTIBLE The process is suspended (sleeping) until some **condition** becomes true. * raising a hardware interrupt * releasing a system resource the process is waiting for * delivering a signal * TASK_UNINTERRUPTIBLE Like TASK_INTERRUPTIBLE, except that delivering a **signal** to the sleeping process leaves its state unchanged. > TASK_INTERRUPTIBLE是可以被信號和wake_up()喚醒的，當信號到來時，進程會被設置為可運行。 > 而TASK_UNINTERRUPTIBLE只能被wake_up()喚醒。 > > 信號事件的發生有兩個來源： > 硬件來源：(比如我們按下了鍵盤或者其它硬件故障) > 軟件來源：最常用發送信號的系統函數是kill, raise, alarm和setitimer以及sigqueue函數，軟件來源還包括一些非法運算等操作。 * __TASK_STOPPED * SIGSTOP * SIGTSTP the **suspend keystroke** (normally ^Z) is pressed on its controlling tty and it's **running in the foreground**. * SIGTTIN * SIGTTOU http://vinllen.com/hou-tai-jin-cheng-du-xie-kong-zhi-tai-hong-fa-sigttin-sigttouxin-hao-liang/ > SIGSTOP > This signal cannot be caught or ignored; it always stops the process as soon as it's received. > It can't ever wake itself up (because it isn't running!), so it just sits in the stopped state until it receives a SIGCONT. * __TASK_TRACED When a process is being monitored by another (such as when a debugger executes a **ptrace( )** system call to monitor a test program), each signal may put the process in the TASK_TRACED state. #### Fatal signal A signal is fatal for a given process if delivering the signal causes the kernel to kill the proces Each signal whose **default action is “Terminate” and which is not caught by a process** is fatal for that process. ex: SIGKILL > 如果是被行程接受後由 **signal-handler function** 來關閉行程的signal不是fatal * TASK_WAKEKILL 用于在接收到致命信号时唤醒进程 ``` #define TASK_KILLABLE (TASK_WAKEKILL | TASK_UNINTERRUPTIBLE) #define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED) #define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED) ``` * TASK_KILLABLE 比 TASK_UNINTERRUPTIBLE 能多接收 fatal signals * EXIT_ZOMBIE Process execution is terminated, but the parent process has not yet issued a **wait4( ) or waitpid( )** system call to return information about the dead process > Before the wait( )-like call is issued, the kernel cannot discard the data contained in the dead process descriptor because the parent might need it. * EXIT_DEAD The final state ![](https://i.imgur.com/DdqyZG0.png) ### state field of process 1. p->state = some state 2. _set_task_state 3. _set_current_state ### thread_info Kernel Mode Stack For each process, Linux packs two different data structures in a single per-process memory area: 1. thread_info 2. kernel mode stack ![](https://i.imgur.com/s88pWbb.png) ``` union thread_union { struct thread_info thread_info; unsigned long stack[2048]; /* 1024 for 4KB stacks */ }; ``` > The length of the structure thread_info and kernel mode stack memory area of a process is **8,192 bytes (two page frames)** after Linux 2.6.37 ### Current 由於kernel mode stack和thread_info被包在一起，所以我們可以透過esp輕易取得thread_info的位址 if the thread_union structure is 8 KB (213 bytes) long, the kernel masks out the **13 least significant bits**(12 for 4KB) of esp to obtain the base address of the thread_info structure. #### macro current(before Linux 2.6.22) ``` current_thread_info( ) movl $0xffffe000,%ecx /*or 0xfffff000 for 4KB stacks*/ andl %esp,%ecx movl %ecx,p ``` ``` static __always_inline struct task_struct * get_current(void) { return current_thread_info()->task; } #define current get_current() ``` #### macro current(after Linux 2.6.22) Linux stores the task_struct address of current process in the per-CPU variable **current_task** >在Linux操作系统中，特别是针对SMP或者NUMA架构的多CPU系统的时候，描述每个CPU的私有数据的时候，Linux操作系统提供了per_cpu机制。 > per_cpu机制就是让每个CPU都有自己的私有数据段，便于保护与访问 ``` #define this_cpu_read_stable(var) percpu_from_op("mov", var, "p" (&(var))) //asm(movl "%%fs:%P1","%0" : "=r" (pfo_ret__) :"p" (&(var)) static __always_inline struct task_struct *get_current(void) { return this_cpu_read_stable(current_task); } #define current get_current() ``` #### When will the Content of current_task be Set? 1. Variable Initialization ``` DEFINE_PER_CPU(struct task_struct *, current_task) = &init_task; ``` 2. Context switch When making a context switch at __switch_to to switch CPU to a different process (**next_p**), __switch_to invokes this_cpu_write to store the **task_struct address of next_p** in the per-CPU **current_task** variable of related CPU ``` this_cpu_write(current_task, next_p); ``` ## Linked List ### struct list_head ![](https://i.imgur.com/BELhAOm.png) ### Macro LIST_HEAD(name) ``` struct list_head { struct list_head *next, *prev; }; #define LIST_HEAD_INIT(name) { &(name), &(name) } #define LIST_HEAD(name) \ struct list_head name = LIST_HEAD_INIT(name) ``` ### Some function and macro * list_entry(ptr, type, member) 給定一個自己定義的type，加上型態為list_head的member的位址和名稱，可以得到該物件的位址 ![](https://i.imgur.com/LUlWTQQ.png) ``` /** * list_entry - get the struct for this entry * @ptr: the &struct list_head pointer. * @type: the type of the struct this is embedded in. * @member: the name of the list_struct within the struct. */ #define list_entry(ptr, type, member) \ container_of(ptr, type, member) #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER) #define container_of(ptr, type, member) ({ const typeof( ((type *)0)->member ) *__mptr = (ptr); \ (type *)( (char *)__mptr - offsetof(type,member) );}) ``` ### hlist_head hlist_node The Linux kernel 3.9 supports another kind of doubly linked list, which mainly differs from a list_head list because it is **NOT circular** It can be used for **hash tables** ![](https://i.imgur.com/J1hBVuP.png) ### The Process List The process list is a **circular doubly linked list** that links the process descriptors of all existing thread group leaders Each task_struct structure includes a **tasks** field of type list_head whose prev and next fields point, respectively, to the previous and to the next task_struct element’s tasks field. ``` struct task_struct { ... struct list_head tasks; ... } ``` #### init_task The head of the process list, it is the process descriptor of the so-called **process 0 or swapper**. The tasks->prev field of init_task points to the tasks field of the process descriptor inserted **last in the list**. > https://blog.csdn.net/fervor_heart/article/details/8197851 #### for_each_process ``` #define next_task(p) \ list_entry_rcu((p)->tasks.next, struct task_struct, tasks) #define for_each_process(p) \ for (p = &init_task ; (p = next_task(p)) != &init_task ; ) ``` ### runqueue #### in Early Linux Version put all runnable processes in the runqueue Because it would be too costly to maintain the list ordered according to process priorities, the earlier schedulers were compelled to **scan the whole list** in order to select the "best" runnable process #### Linux Versions 2.6 ~ 2.6.23 Linux 2.6 achieves the scheduler speedup by splitting the runqueue in many lists of runnable processes, **one list per process priority**. * Each task_struct descriptor includes a **run_list** field of type list_head. * on a multiprocessor system, each CPU has its own runqueue **struct prio_array** ``` struct prio_array { unsigned int nr_active; unsigned long bitmap[BITMAP_SIZE]; struct list_head queue[MAX_PRIO]; }; ``` * nr_active: the number of process descriptors linked into the lists. * bitmap: a priority bitmap: each flag is set if and only if the corresponding priority list is not empty * queue: the 140 heads of the priority lists. **prio**: dynamic priority of the process **array**: point to the prio_array_t of its current runqueue ![](https://i.imgur.com/AAG4Qhk.png) #### enqueue_task(p,array) insert a process descriptor into a runqueue list ``` enqueue_task is same as follow: list_add_tail(&p->run_list, &array->queue[p->prio]); __set_bit(p->prio, array->bitmap); array->nr_active++; p->array = array; ``` #### dequeue_task(p, array) remove a process descriptor from a runqueue list ### Linux version 2.6.24 ~ 3.9 Each CPU has its own run queue, and each active process appears on just one run queue. ``` struct rq { raw_spinlock_t lock; : struct cfs_rq cfs; struct rt_rq rt; : struct task_struct *curr, *idle, *stop; unsigned long next_balance; struct mm_struct *prev_mm; : }; ``` * cfs and rt are the embedded sub-run queues for the completely fair scheduler and real-time scheduler, respectively. * curr points to the task_struct of the process currently running. * idle points to the task structure of the idle process called when no other runnable process is available. #### runqueues array All run queues of the system are held in the runqueues array, which contains an element for each CPU in the system. ``` static DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues); ``` ## Scheduler The scheduler of a Linux kernel decided which process is executed by a CPU ### Linux Scheduler Entry Point schedule() -> __schedule() -> context_switch() -> switch_to -> __switch_to() ![](https://i.imgur.com/1dYDC5W.png) ### Scheduling classes Scheduling classes are used to decide which task runs next > Scheduling classes allow for implementing different scheduling policies in a modular way: Code from one class does not need to interact with code from other classes > #### priority of linux scheduling classes All existing scheduling classes in the kernel are in a list which is ordered by the priority of the scheduling class. **stop_sched_class → rt_sched_class → fair_sched_class → idle_sched_class → NULL** ![](https://i.imgur.com/PJddWBI.png =70%x) #### Stop and Idle scheduling classes * Stop is used to schedule the per-cpu stop task which pre-empts everything and can be pre-empted by nothing. * Idle is used to schedule the per-cpu idle task (also called swapper task) which is run if no other task is runnable. ## NameSpace The purpose of each namespace is to wrap a particular global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource * Mount namespaces (CLONE_NEWNS) * UTS namespaces (CLONE_NEWUTS) * IPC namespaces (CLONE_NEWIPC) * PID namespaces (CLONE_NEWPID) * Network namespaces (CLONE_NEWNET) * User namespaces (CLONE_NEWUSER) **namespace-related APIs** * clone() * unshare() * setns() ### PID Namespace ``` child_pid = clone(childFunc, child_stack, CLONE_NEWPID | SIGCHLD, argv[1]); ``` ![](https://i.imgur.com/d0YxDMe.png =60%x) > 只有 root pid namespace 有 pid 0 > init process 負責回收其他 orphaned process 的資源(zombie process) > By default, the maximum PID number is PID_MAX_LIMIT-1 (32,767 or 262143). * Global IDs * Local IDs #### init process Other processes in the namespace (even privileged processes) can send only those signals for which the init process has established a signal handler Note that (as for the traditional init process) the kernel can still generate signals for the PID namespace init process in all of the usual circumstances * hardware exceptions * terminal-generated signals such as SIGTTOU * expiration of a timer #### Signals from Ancestor Namespaces * SIGKILL * SIGSTOP > if process in an ancestor PID namespace sends SIGKILL/SIGSTOP to the init process, all the process in the PID namespace will be killed ### struct nsproxy A nsproxy is shared by processes which share all namespaces ``` struct nsproxy { atomic_t count; // number of processes holding a reference. struct uts_namespace *uts_ns; struct ipc_namespace *ipc_ns; struct mnt_namespace *mnt_ns; struct pid_namespace *pid_ns; struct net *net_ns; }; ``` **Initial Global Namespace** ``` struct nsproxy init_nsproxy = { .count = ATOMIC_INIT(1), .uts_ns = &init_uts_ns, #if defined(CONFIG_POSIX_MQUEUE)|| defined(CONFIG_SYSVIPC) .ipc_ns = &init_ipc_ns, #endif .mnt_ns = NULL, .pid_ns = &init_pid_ns, #ifdef CONFIG_NET .net_ns = &init_net, #endif }; ``` ### Thread Group the POSIX 1003.1c standard states that **all threads of a multithreaded application** must have the same PID **TGID**: the PID of the first lightweight process in the group > 對 multithreaded application getpid時得到 TGID ### Process Group **PGID** in order to execute the command line: ```$ ls | sort | more``` a shell that supports process groups, such as bash, creates a new group for the three processes corresponding to ls, sort, and more. 用途： 1. **for distribution of signals** 2. by terminals to arbitrate requests for their input and output example: **Foreground Process Groups**: Every process in the foreground receives SIGINT (^C ) SIGQUIT (^\ ) and SIGTSTP signals **Background Process Groups**: Every process in the foreground receives SIGINT (^C ) SIGQUIT (^\ ) and SIGTSTP signals ### Login Sessions **SID** a login session contains **all processes** that are descendants of the process that has started a working session on a specific terminal -- usually, the first command shell process created for the user. > A login session may have several process groups active simultaneously > All processes in a process group must be in the same login session #### struct pid_namespace ``` struct pid_namespace { struct kref kref; @struct pidmap pidmap[PIDMAP_ENTRIES]; int last_pid; unsigned int nr_hashed; @struct task_struct *child_reaper; struct kmem_cache *pid_cachep; @unsigned int level; @struct pid_namespace *parent; : struct user_namespace *user_ns; struct work_struct proc_work; kgid_t pid_gid; int hide_pid; int reboot; /* group exit code if this pidns was rebooted */ unsigned int proc_inum; }; ``` * **child_reaper**: point to init process Every PID namespace is equipped with a process that assumes the role taken by init in the global picture * **pidmap**: ``` //looking for the first bit in the bitmap whose value is 0 static int alloc_pidmap(struct pid_namespace *pid_ns) //‘‘toggling‘‘ the corresponding bit from 1 to 0. static void free_pidmap(struct upid *upid) ``` ### PID管理相關struct #### struct task_struct ``` struct task_struct { ... pid_t pid; pid_t tgid; struct task_struct *group_leader; // thread group leader struct pid_link pids[PIDTYPE_MAX]; ... } ``` pid_link負責串起所以屬於同一個group的process descriptor ``` struct pid_link { struct hlist_node node; struct pid *pid; }; ``` 如果該process屬於某個group，**pid**指向leader的pid struct #### enum pid_type ``` enum pid_type { PIDTYPE_PID, PIDTYPE_PGID, PIDTYPE_SID, PIDTYPE_MAX }; ``` #### struct upid ``` struct upid { /* Try to keep pid_chain in the same cacheline as nr for find_vpid */ int nr; struct pid_namespace *ns; struct hlist_node pid_chain; }; ``` #### struct pid ``` struct pid { atomic_t count; unsigned int level; /* lists of tasks that use this pid */ struct hlist_head tasks[PIDTYPE_MAX]; struct rcu_head rcu; struct upid numbers[1]; //最後再根據level malloc給他 }; ``` tasks: 如果是leader，負責串起所有同個group的thread group leader的task_struct 而tasks中第一個process會是最後加進這個group的process，leader則待在最尾端 #### struct pid_t ``` typedef int __kernel_pid_t; typedef __kernel_pid_t pid_t; struct task_struct { ... pid_t pid; pid_t tgid; ... } ``` ### Create a New pid Instance When a new process is created, a new **pid instance** is also created ``` long do_fork(unsigned long clone_flags, unsigned long stack_start, unsigned long stack_size, int __user *parent_tidptr, int __user *child_tidptr) { : p = copy_process(clone_flags, stack_start, stack_size, child_tidptr, NULL, trace); : } static struct task_struct *copy_process(unsigned long clone_flags, unsigned long stack_start, unsigned long stack_size, int __user *child_tidptr, struct pid *pid, int trace) { : if (pid != &init_struct_pid) { retval = -ENOMEM; pid = alloc_pid(p->nsproxy->pid_ns); if (!pid) goto bad_fork_cleanup_io; } : } ``` **如果某個process不是thread group leader，那麼他的task_struct中的pids[PIDTYPE_PGID]和pids[PIDTYPE_SID]都不會用到，因為直接透過group_leader到TGL中找即可** ![](https://i.imgur.com/Pama3ep.png =45%x) ![](https://i.imgur.com/tDz2Ywr.png =45%x) ### Thread Group Processes in the same thread group are **chained together through** the thread_group field of their task_struct structures ### attach_pid() a **new struct pid** attach a **task_struct** ``` void attach_pid(struct task_struct *task, enum pid_type type, struct pid *pid) { struct pid_link *link; link = &task->pids[type]; link->pid = pid; link->pid指向該pid struct hlist_add_head_rcu(&link->node, &pid->tasks[type]); 這個task_struct變為list的第一個 return 0; } ``` ex: 今天新產生一個process，要把task_struct的pid只到pid struct再把struct pid->tasks[0]只到自己或是要將一個TGL放到某個Process group ### struct pid related Helper Functions Obtain the **pid instance** associated with the task_struct structure task_pid: 取得這個task_struct的pid struct ``` static inline struct pid *task_pid(struct task_struct *task) { return task->pids[PIDTYPE_PID].pid; } ``` task_pgrp ``` static inline struct pid *task_pid(struct task_struct *task) { return task->pids[PIDTYPE_PID].pid; } ``` task_tgid task_session ### namespace related 查看某個pid struct在特定namespace的pid是多少 ``` pid_t pid_nr_ns(struct pid *pid, struct pid_namespace *ns){ struct upid *upid; pid_t nr = 0; if (pid && ns->level <= pid->level) { upid = &pid->numbers[ns->level]; if (upid->ns == ns) 要確定 pid struct 中的 namespace 跟你指定的是同一個 nr = upid->nr; } return nr; } ``` **pid_vnr**: return the local PID ```return pid_nr_ns(pid, task_active_pid_ns(current))``` **pid_nr**: return the global PID ```return nr = pid->numbers[0].nr``` ### getpid( ) getpid will return tgid 先透過task_tgid取得TGL的pid，再透過pid_vnr取得該TGL的pid ``` static inline pid_t task_tgid_vnr(struct task_struct *tsk){ return pid_vnr(task_tgid(tsk)); } static inline pid_t task_tgid_vnr(struct task_struct *tsk){ return pid_vnr(task_tgid(tsk)); } SYSCALL_DEFINE0(getpid){ return task_tgid_vnr(current); } ``` ### pid_hash Hash Table Hash table pid_hash is used to find the pid instance that belongs to a **numeric PID value(nr)** in a **given namespace(ns)** #### pidhash_init pidhash_init computes the apt size and allocates the required storage ``` void __init pidhash_init(void) { unsigned int i, pidhash_size; pid_hash = alloc_large_system_hash("PID", sizeof(*pid_hash), 0, 18,HASH_EARLY | HASH_SMALL, &pidhash_shift, NULL, 0, 4096); pidhash_size = 1U << pidhash_shift; for (i = 0; i < pidhash_size; i++) INIT_HLIST_HEAD(&pid_hash[i]); } ``` #### Hash Function pid_hashfn 取出struct upid中的nr, ns來計算 ``` static unsigned int pidhash_shift = 4; #define pid_hashfn(nr, ns) \ hash_long((unsigned long)nr + (unsigned long)ns, pidhash_shift ``` #### Add a new upid instance ``` struct pid *alloc_pid(struct pid_namespace *ns){ ... for ( ; upid >= pid->numbers; --upid) { hlist_add_head_rcu(&upid->pid_chain,&pid_hash[pid_hashfn(upid->nr, upid->ns)]); upid->ns->nr_hashed++; } ... } ``` 1. 將 struct upid 放到 hash 中 2. ns->nr_hashed++ #### Obtain the pid Instance from a numbers[] Field 透過 container_of !!! ![](https://i.imgur.com/j4BqwWG.png =70%x) #### Multiple PIDs of a Process 每個 struct pid 在不同的 namespace 會有不同的 pid 所以在 **alloc_pid** 中要替每個namespace 都分配一個 ``` struct pid *alloc_pid(struct pid_namespace *ns) { ... tmp = ns; for (i = ns->level; i >= 0; i--) { nr = alloc_pidmap(tmp); looking for the first bit in the bitmap whose value is 0 ... pid->numbers[i].nr = nr; pid->numbers[i].ns = tmp; tmp = tmp->parent; } pid->level = ns->level; ... } ``` #### find_pid_ns() ``` struct pid *find_pid_ns(int nr, struct pid_namespace *ns) { struct upid *pnr; hlist_for_each_entry_rcu(pnr, &pid_hash[pid_hashfn(nr, ns)], pid_chain) if (pnr->nr == nr && pnr->ns == ns) return container_of(pnr, struct pid, numbers[ns->level]); return NULL; } ``` ### Relationships among Processes * real_parent 建立此行程的行程，points to the descriptor of process 1 (init) if the parent process no longer exists * parent 基本上跟real_parent一樣，但是用debugger時會是debugger #### children & sibling ``` struct list_head children; /* list of my children */ struct list_head sibling; /* linkage in my parent's children list ``` ![](https://i.imgur.com/AwWDC8g.png =70%x) #### ptraced & ptrace_entry 用p->parent->ptraced 將 p->ptrace_entry串在一起(與上面相似) ``` struct list_head ptraced; struct list_head ptrace_entry; ``` ## Waitqueue The runqueue lists group all processes in a TASK_RUNNING state Processes in a TASK_STOPPED, EXIT_ZOMBIE, or EXIT_DEAD state are not linked in specific lists. 1. via PID 2. via linked lists of the child processes for a particular parent ### waitqueue( a set of sleeping processes　) Wait queues implement conditional waits on events * TASK_INTERRUPTIBLE * TASK_UNINTERRUPTIBLE > wake up 代表從 waitqueue 拿出來 ### struct wait_queue_head_t wait_queue_t ``` struct __wait_queue_head { spinlock_t lock; struct list_head task_list; }; typedef struct __wait_queue_head \ wait_queue_head_t struct __wait_queue { unsigned int flags; #define WQ_FLAG_EXCLUSIVE 0x01 void *private; process descriptor address wait_queue_func_t func; //specify how the processes sleeping in the wait queue should be woken up struct list_head task_list; //將所有同一個wait queue的process串起來 }; typedef struct __wait_queue wait_queue_t; ``` > 同一個 wait queue: > to the list of processes waiting for the same event #### flags 這個 process 要被叫醒時，要只叫醒你或是叫醒全部 nonexclusive process * WQ_FLAG_EXCLUSIVE: are selectively woken up by the kernel * ~WQ_FLAG_EXCLUSIVE: are always woken up by the kernel when the event occurs. **Thundering Herd** Multiple sleeping processes are awoken only to race for a resource that can be accessed by one of them, and the result is that remaining processes must once more be put back to sleep Ex: 如過今天有一種資源一次只能給一個行程使用，我們卻將整個 wait queue 的行程都叫醒，除了得到資源的其中一個行程之外，它們要先被放到 run queue，等到好不容易排到優先權最高後，經過一次 PROCESS SWITCH，因為該資源還是被占用，所以要再被拿到 wait queue，此時又要再經過一次 PROCESS SWITCH -> Waste CPU time ### Declare a New Wait Queue Head ``` DECLARE_WAIT_QUEUE_HEAD(name) ``` ![](https://i.imgur.com/YfBTprD.png =60%x) ``` static inline void init_waitqueue_entry(wait_queue_t *q, struct task_struct *p){ q->flags = 0; q->private = p; q->func = default_wake_function; } ``` > The nonexclusive process p will be awakened by default_wake_function( ), which is a simple wrapper for the try_to_wake_up( ). #### DEFINE_WAIT ``` #define DEFINE_WAIT_FUNC(name, function) wait_queue_t name = { .private = current, .func = function, .task_list = LIST_HEAD_INIT((name).task_list), } #define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function) ``` ![](https://i.imgur.com/PlnViXE.png =60%x) ### relate function 1. add_wait_queue( ): function inserts a nonexclusive process in the first position of a wait queue list 3. add_wait_queue_exclusive( ): function inserts an exclusive process in the last position of a wait queue list 5. remove_wait_queue( ) 6. waitqueue_active( ) ### Functions That Can Put a Process to a Wait Queue * sleep_on( ) ``` void sleep_on(wait_queue_head_t *wq) { wait_queue_t wait; init_waitqueue_entry(&wait, current); current->state = TASK_UNINTERRUPTIBLE; add_wait_queue(wq,&wait); /* wq points to the wait queue head */ schedule( ); remove_wait_queue(wq, &wait); } ``` * interruptible_sleep_on( ) * sleep_on_timeout( ) invoke the schedule_timeout( ) function instead of schedule( ) * interruptible_sleep_on_timeout( ) * wait_event * wait_event_interruptible * prepare_to_wait( ) * prepare_to_wait_exclusive( ) * finish_wait() ## Process Resource Limits The resource limits for the current process are stored in the **current->signal->rlim** ``` struct rlimit rlim[RLIM_NLIMITS]; ``` The field is an array of elements of type struct rlimit, one **for each resource limit** ``` struct rlimit { //current limit on the CPU time of the running process unsigned long rlim_cur; //The rlim_max field is the maximum allowed value for the resource limit unsigned long rlim_max; }; ``` 1. RLIMIT_AS: The maximum size of **process address space** * malloc( ) 2. RLIMIT_CORE: The maximum **core dump file size** 3. RLIMIT_CPU: The maximum **CPU time** for the process If the process exceeds the limit, the kernel sends it a SIGXCPU signal, and then, if the process doesn't terminate, a SIGKILL signal 4. RLIMIT_DATA: The maximum **heap size** 5. RLIMIT_FSIZE: The maximum **file size allowed** If the process tries to enlarge a file to a size greater than this value, the kernel sends it a SIGXFSZ signal 6. RLIMIT_LOCKS: Maximum number of **file locks** 7. RLIMIT_MEMLOCK: The maximum **size of nonswappable memory** 8. RLIMIT_MSGQUEUE: Maximum number of bytes in **POSIX message queues** 9. RLIMIT_NOFILE: The maximum number of **open file descriptors** 10. RLIMIT_NPROC: The maximum number of **processes** that the user can own 11. RLIMIT_RSS: The maximum number of **page frames**(物理地址 RAM) owned by the process 12. RLIMIT_SIGPENDING: The maximum number of **pending signals** for the process 13. RLIMIT_STACK: The maximum stack size, in bytes The kernel checks this value before expanding the User Mode stack of the process ## Processes Switch ### switch_to ## Memory Overcommit memory overcommit 是指分配給 process 的內存大於他所需要的，基本上不應該發生這種狀況，直覺來說每個 process 應該要用多少就拿多少，不過 linux 中允許 overcommit，只要 process 有來**申請**基本上都會分配給你，只能希望大家不會同時用到超過物理地址大小的內存（linux 實際上在分配物理內存是等到 process 有使用到才會分配給他（就像第一次project中所看到）只是申請的時候還不會分配）那假如真的發生超過物理地址大小的話，linux 中有 OOM(out) Killer 的機制負責砍掉 process > https://www.zhihu.com/question/21972130 > http://linuxperf.com/?p=102 > ###### tags: `考試複習`