Annotate and explain Quiz6 (2020 Fall)

# Annotate and explain Quiz6 (2020 Fall) contributed by <`徐郁淞`> ###### tags: `Computer Architecture` [TOC] ## Question `A` :::spoiler description Someone has designed 4 separate concurrent processes each of which prints a single character "A", "T", "C" or "G". His/Her customers place orders for sequences that satisfy certain constraints and he/she adds semaphores as appropriate to ensure the printed sequence meets the specified criteria. For each of customer orders below, add the appropriate semaphores so the running processes will produce sequences that make the customers happy – do not forget to specify the semaphores' initial values! To receive full credit, do not impose any unnecessary constraints. Assume the processes start running immediately and that there are no constraints on the order in which statements in different processes are executed except those imposed by your semaphores. Processes will run indefinitely although they may, of course, end up stuck in a `WAIT()`. ::: :::spoiler Q A(1) Request: "the sequence CAT" #### (1) Request: "the sequence CAT" ```cpp semA = __ A01 __ semT = __ A02 __ semC = __ A03 __ semG = __ A04 __ Process #1 Process #2 Process #3 Process #4 ---------- ---------- ---------- ---------- A: C: G: T: A05 A07 A09 A10 puts("A") puts("C") puts("G") puts("T") A06 A08 goto A goto C goto G goto T ``` `A01` to `A04` are the values such as `0` and `1`. `A05`-`A10` may be the operation of `wait` and `signal` such as `wait(semT)`. A01 : ==0==, A02 : ==0==, A03 : ==1==, A04 : ==0== A05 : ==wait(semA)==, A06 : ==signal(semT)== A07 : ==wait(semC)==, A08 : ==signal(semA)== A09 : ==wait(semG)== A10 : ==wait(semT)== ::: **Annotation:** We need to know the [Bascic Concept](https://pages.mtu.edu/~shene/NSF-3/e-Book/SEMA/basics.html) of semaphore first: * When Wait is executed by a thread, we have two possibilities: * **The counter of S is positive** In this case, the counter is decreased by 1 and the thread resumes its execution. * **The counter of S is zero** In this case, the thread is suspended and put into the private queue of S. * When Signal is executed by a thread, we also have two possibilities: * **The queue of S has no waiting thread** The counter of S is increased by one and the thread resumes its execution. * **The queue of S has waiting threads** In this case, the counter of S must be zero (see the discussion of Wait above). One of the waiting threads will be allowed to leave the queue and resume its execution. The thread that executes Signal also continues. **Request: “the sequence CAT”** 1. First character is `C`. The ==semA=0==,==semT=0== ==semC=1== ==semG=0==, and use the `wait` respective. The `semC` make the `Process #2` resume its execution `puts("C")`. And other semophore counter is `0`, the `Processes` will suspend. 3. Second character is `A`. The `A08` = ==signal(semA)== make the `Process #1` resume its execution `puts("A")`. 4. Then `A06` = ==signal(semT)==, the `Process #4` resume its execution `puts("T")`. --- :::spoiler Q A(2) Request "sequences have to be exactly 4 characters long" #### (2) Request "sequences have to be exactly 4 characters long" ``` semC = __ A11 __ Process #1 Process #2 Process #3 Process #4 ---------- ---------- ---------- ---------- A: C: G: T: A12 A13 A14 A15 puts("A") puts("C") puts("G") puts("T") goto A goto C goto G goto T ``` `A11` is the value such as `0`, `1`, `2` and etc. `A12`-`A15` may be the operation of `wait` and `signal` such as `wait(semT)`. A11 : ==4== A12 : ==wait(semC)== A13 : ==wait(semC)== A14 : ==wait(semC)== A15 : ==wait(semC)== ::: **Annotation:** This question ask to exactly 4 characters. We make the semopher is `4` (a.k.a ==semC=4==), and every `wait` will decrease the `semC`. If we have the four `wait`, the `semC` will decrease to `0`, then no process can execute. The sequences will equal to 4 characters long. --- ## Question `B` :::spoiler description We assume the existence of three asynchronous processes: * a Producer process, producing a stream of characters; * a Consumer process, which consumes a stream of characters; and * a Filter process spliced between the Consumer and Producer processes. The Filter process takes characters from the producer, processes them (via a translate function), and passes the result to the Consumer process. The Producer and Consumer processes each communicate directly only with the Filter process. The following is in Shared Memory (shared among Producer, Filter, and Consumer processes): ```cpp Semaphore charsA = ??, spaceA = ??, charsB = ??, spaceB = ??; char buf[100]; char in_data; int in = 0, out = 0; ``` and the following code runs in the Filter Process: ```cpp while (1) { char tmp; wait(charsA); tmp = in_data; signal(spaceA); tmp = translate(tmp); /* do the actual translation */ wait(spaceB); buf[in] = tmp; in = (in + 1) % 100; /* increment ‘in’ modulo 100 */ signal(charsB); } ``` ::: (1) What is the maximum number of characters that can be produced by the Producer process but not yet processed by the Filter process? * Maximum unprocessed characters produced: __ B01 __ > B01 : ==1== **Annotation:** We can observe the Filter Process. The `in_data` is a character `tmp = in_data` and `tmp = translate(tmp)`, which means filter process deal with one character in one time. (2) What are appropriate initial values for each of the semaphores? * initial value for charsA: __ B02 __ * initial value for spaceA: __ B03 __ * initial value for charsB: __ B04 __ * initial value for spaceB: __ B05 __ > B02 : ==0==, B03 : ==1==, B04 : ==0==, B05: ==100== **Annotation:** As the Q1 we know Maximum unprocessed is one. The Process A will `wait(spaceA)` to make sure there have space to produce the character. ==initial value for spaceA: 1== And The Process A produce character, then `signal(character)`. Which means the character have prodeced. ==initial value for charsA: 0== The processA produced characters to processB by filter process, the buf has no character at beginning. It use `wait(spaceB)` to decrease spaceB until the B has no space, and `buf[100]`, So ==initial value for spaceB: 100== It use `signal(charB)` that means process B has no character at beginning. ==initial value for charsB: 0== --- ## Question `C` :::spoiler description For the following questions, assume a processor with 64-bit virtual addresses, 40-bit physical addresses and page size of 4096 ($2^{12}$) bytes per page. The Page Table of this processor uses an LRU replacement strategy, and handles missing pages using a page fault handler. ::: :::spoiler Q C.1 (1) What is the size of the page table? Assume that each page table entry includes a dirty bit and a resident bit. Specify the number of page table entries and the width of each entry. * Number of entries in the page table: __ C01 __ * Width of each page table entry (bits): __ C02 __ > C01: ==$2^{52}$== > C02 : ==30== ::: **Annotation**: The page table needs one entry per page. A processor with 64-bit virtual addresses and a page size of 4096 ($2^{12}$), we see that the $2^{64}$ byte address space must be split into $2^{52}$ pages This means the page table must have $2^{52}$ entries. => ==C01: $2^{52}$== Each entry contains a frame number. Since there are $2^{40}$ physical addresses divided into frames of size $2^{12}$ (frame size = page size), we see there are $2^{28}$ frames, so we need ==28== bits to store the frame number. We also need some extra bits: since pages may be swapped out to disk, we need a valid bit in the page table entry to indicate whether the page is valid(i.e. currently present in memory) or not. We can also store a dirty bit so that we can avoid writing pages back to disk if we need to swap them out => ==C02:30==(28 + valid bit + dirty bit) :::spoiler Q C.2 (2) What is the maximum fraction of virtual memory that can be resident in physical memory at any given time (assuming the page table is not in physical memory)? * Max fraction of virtual memory that can be resident in physical memory: __ C03 __ > C03 : ==$\frac{1}{2^{24}}$== ::: **Annotation:** There are $2^v$ virtual pages of which $2^m$ can be resident. So the fraction of resident pages is ${2^m / \ 2^v} = {2^{28} / \ 2^{52}} = 1 / 2^{24}$ :::spoiler Q C.3 (3) The following program fragment is executed and a record is made of the inputs and outputs of the Memory Management Unit. The record is shown in the table below. ```c sw x11, 0(x10) lw x11, 4(x13) lui x12, 4 ``` | Access type | Virtual address | Physical address | |-------------|-----------------|------------------| | IF | 0x60FF8 | 0x10FF8 | | DW | 0x04600 | 0x74600 | | IF | 0x60FFC | 0x10FFC | | DR | 0x18410 | 0x169410 | | IF | 0x61000 | 0x09000 | > IF: instruction fetch, DW: data write, DR: data read Using information from the program and the table above, please deduce the contents of as many entries as possible in the page table. Assume the original address sizes of 64-bit virtual addresses, 40-bit physical addresses, and page size of 4096 ($2^{12}$) bytes per page. Assume that **pages holding instructions are read-only**. Please make an entry in the table below for each page table entry we learn about after the execution of the program fragment. Note that the table below is not an actual page table, it is just a list of entries from the page table that you can infer from this problem. For each entry provide the VPN, the dirty (`D`) and resident (`R`) bits, and the PPN if they are known. If you can not deduce the value of a field, enter a `?` for that field. | VPN | D | R | PPN | |-----|---|---|-----| | 0x60 | C04 | C05 | C06 | | 0x04 | C07 | C08 | C09 | | 0x18 | C10 | C11 | C12 | | 0x61 | C13 | C14 | C15 | > C04 : ==0==, C05 : ==1==, C06 : ==0x61== > C07 : ==1==, C08 : ==1==, C09 : ==0x74== > C10 : ==?==, C11 : ==1==, C12 : ==0x169== > C13 : ==0==, C14 : ==1==, C15 : ==0x9== ::: **Annotation:** >Setup: >page size: 4096($2^{12}$) bytes per page >virtual pages: $2^{52}$ >physical pages: $2^{28}$ >64-bit Virtual address >40-bit Physical address * VA = 52-bits VPN + 12-bits OFFSET e.g. 0x60FF8 => `0x60 VPN` and `FF8 OFFSET` * PA = 40-bits PPN + 12-bits OFFSET e.g. 0x10FF8 => `0x10 PPN` and `FF8 OFFSET` **Fill entries step by step:** 1. `IF` `VA: 0x60FF8` `PA: 0x10FF8` | VPN | D | R | PPN | |-----|---|---|-----| | 0x60 | 0 | 1 | 0x10 | | 0x04 | C07 | C08 | C09 | | 0x18 | C10 | C11 | C12 | | 0x61 | C13 | C14 | C15 | 2. `DW` `VA: 0x04600` `0x74600` | VPN | D | R | PPN | |-----|---|---|-----| | 0x60 | 0 | 1 | 0x10 | | 0x04 | 1 | 1 | 0x74 | | 0x18 | C10 | C11 | C12 | | 0x61 | C13 | C14 | C15 | 3. `IF` `VA: 0x60FFC` `0x10FFC` | VPN | D | R | PPN | |-----|---|---|-----| | 0x60 | 0 | 1 | 0x60 | | 0x04 | 1 | 1 | 0x74 | | 0x18 | C10 | C11 | C12 | | 0x61 | C13 | C14 | C15 | 5. `DR` `VA: 0x18410` `0x169410` | VPN | D | R | PPN | |-----|---|---|-----| | 0x60 | 0 | 1 | 0x60 | | 0x04 | 1 | 1 | 0x74 | | 0x18 | ? | 1 | 0x169 | | 0x61 | C13 | C14 | C15 | 6. `IF` `VA: 0x61000` `0x09000` | VPN | D | R | PPN | |-----|---|---|-----| | 0x60 | 0 | 1 | 0x60 | | 0x04 | 1 | 1 | 0x74 | | 0x18 | ? | 1 | 0x169 | | 0x61 | 0 | 1 | 0x9 | :::spoiler Q C.4 (4) At some later point in time, suppose that the contents of the page table and its corresponding fully associative TLB are as shown to the below. As previously mentioned, the page table uses an LRU replacement policy, and the LRU page (shown below) will be chosen for replacement if necessary. For each of these four accesses, compute its corresponding physical address and indicate whether the access causes a TLB miss and/or a page fault. Assume that **each access below is independent of the others and starts with the TLB and page table state shown to the below**. | VPN (tag) | V | D | PPN | Note | |-----------|---|---|-----|------| | `0x0` | 1 | 0 | `0x1337` | - | | `0x1` | 0 | 0 | `0x7` | - | | `0x6` | 1 | 1 | `0x33` | $\leftarrow$ LRU | | `0x3` | 1 | 0 | `0x534` | - | > TLB | VPN | R | D | PPN | Note | |-----|---|---|-----|------| | 0 | 1 | 0 | `0x1337` | - | | 1 | 0 | 0 | - | - | | 2 | 1 | 0 | `0x450` | - | | 3 | 1 | 0 | `0x534` | $\leftarrow$ LRU page | | 4 | 0 | 0 | - | - | | 5 | 1 | 1 | `0xAB5` | - | | 6 | 1 | 1 | `0x33` | - | > Page Table Fill in table below (possible answers contains HEX and Yes/No) | - | Virtual address | PPN (in hex) | Physical Address (in hex) | TLB Miss? | Page Fault? | |---|----------|-----|-----|-----|-----| | 1 | `0x6004` | C16 | C17 | C18 | C19 | | 2 | `0x6030` | C20 | C21 | C22 | C23 | | 3 | `0x1234` | C24 | C25 | C26 | C27 | | 4 | `0x2008` | C28 | C29 | C30 | C31 | > C16 : `0x33`, C17 : `0x33004`, C18 : ==No==, C19 : ==No== > C20 : `0x33`, C21 : `0x33030`, C22 : ==No==, C23 : ==No== > C24 : `0x534`, C25 : `0x534234`, C26 : ==Yes==, C27 : ==Yes== > C28 : `0x450`, C29 : `0x450008`, C30 : ==Yes==, C31 : ==No== ::: **Annotation:** >Setup: >page size: 4096($2^{12}$) bytes per page >virtual pages: $2^{52}$ >physical pages: $2^{28}$ >64-bit Virtual address >40-bit Physical address * VA = 52-bits VPN + 12-bits OFFSET e.g. 0x60FF8 => `0x60 VPN` and `FF8 OFFSET` * PA = 40-bits PPN + 12-bits OFFSET e.g. 0x10FF8 => `0x10 PPN` and `FF8 OFFSET` We need to find TLB first. If TLB miss occur, then we find page table. If page fault occur, then the LRU page will be chosen for replacement. Step: 1. split VA to VNP and OFFSET e.g. 0x6004 => `0x6 VPN` and `004 OFFSET` 2. Use VPN to find PPN 1. first find TLB * hit -> OK * miss -> find page table 2. find page table * hit -> OK * miss -> use LRU page for replacement 3. Concate PPN and OFFSET e.g. `0x33 PPN` and `004 OFFSET` => `0x33004 PA` **Final:** | - | Virtual address | PPN (in hex) | Physical Address (in hex) | TLB Miss? | Page Fault? | |---|----------|-----|-----|-----|-----| | 1 | `0x6004` | 0x33 | 0x33004 | No | No | | 2 | `0x6030` | 0x33 | 0x33030 | No | No | | 3 | `0x1234` | 0x534 | 0x534234 | Yes | Yes | | 4 | `0x2008` | 0x450 | 0x450008 | Yes | No | --- ## Question `D` Consider a five-stage pipelined RISC-V processor `P1` (IF, DEC, EXE, MEM, WB), where: * All branches are predicted not-taken. * Branch decisions are made in the EXE stage. * The pipeline has full bypassing. Assume that the loop shown to the beblow has been running for a while and will be re-executed after this iteration. ```cpp loop: add x3, x2, x1 slli x2, x2, 1 xor x4, x3, x4 bnez x2, loop and x11, x2, x1 or x12, x22, x7 sra x13, x23, x8 ... ``` (1) How many cycles does it take to execute one iteration of the loop on this processor `P1`? * Cycles per iteration on this processor: __ D01 __ > D01 : ==6== > ![](https://i.imgur.com/wtFSQxC.png) > **Annotation** We fill the sheet of 5-stage pipeline as above, and we observe the `add` intruction at cycle `1` and `7`. ==D01: 6== (2) Now consider a variant of the previous five-stage pipelined RISC-V processor `P2` (IF, DEC, EXE, MEM, WB), where: * All branches are predicted not-taken. * Branch decisions are made in the EXE stage. * The pipeline has no bypass paths. Assume that our example loop (same as the previous one) has been running for a while and will be re-executed after this iteration. How many cycles does it take to execute one iteration of the loop on this processor? * Cycles per iteration on P2: __ D02 __ > D02 : ==8== ![](https://i.imgur.com/fjK2B3d.png) --- ## Question `E` :::spoiler Description Consider the execution of the following code sequence on a 5-stage pipelined RISC-V processor, which is fully bypassed, predicts all branches not-taken, and annuls instructions following taken branches. Assume that branch taken/not-taken decisions are made in the EXE stage. Also, assume that the results of load operations are not available until the WB stage. The loop sums the first `100` elements of the integer array at address 0x1000 and stores the result at address `0x2000`. Assume execution halts at instruction `UNIMPL`. ```cpp= lui x11, 1 // x11 = 0x1000 (array) lui x15, 2 // x15 = 0x2000 (result) addi x12, x0, 0 // x12 holds sum addi x14, x11, 400 // address of array element 101 A: lw x13, 0(x11) // load next array element addi x11, x11, 4 // addr of next array element add x12, x12, x13 // add element value to sum bne x11, x14, A // loop until element 101 sw x12, 0(x15) // store result xor x2, x3, x4 UNIMPL ``` ::: #### (1) Are there points in the execution of the sequence when data is bypassed from the EXE stage back to the DEC stage? If so, give the instruction(s) in the DEC stage at each such point; otherwise, enter `N/A`. * Instruction(s) in DEC, or `N/A`: __ E01 __ > E01 : ==`N/A`== **Annotation** No situation need to bypass. #### (2) Are there points in the execution of the sequence when data is bypassed from the WB stage back to the DEC stage? If so, give the instruction(s) in the DEC stage at each such point; otherwise, enter `N/A`. * Instruction(s) in DEC, or `N/A`: __ E02 __ > E02 : ==`add, bne`== **Annotation:** ```cpp=5 A: lw x13, 0(x11) // load next array element addi x11, x11, 4 // addr of next array element add x12, x12, x13 // add element value to sum bne x11, x14, A // loop until element 101 ``` We neet to bypassed from the WB stage back to the DEC stage when that exhibit data dependence. Situation in which a data hazard can occur: * `Read after write (RAW):` * (i2 tries to read a source before i1 writes to it) A read after write (RAW) data hazard refers to a situation where an instruction refers to a result that has not yet been calculated or retrieved. This can occur because even though an instruction is executed after a prior instruction, the prior instruction has been processed only partly through the pipeline. * e.g. i1. **R2** <- R5 + R3 i2. R4 <- **R2** + R3 Now we see the program: :::success A: lw **x13**, 0(x11) addi x11, x11, 4 add x12, x12, **x13** ::: The **x13** has the RAW data dependency, so we use bypass to avoid Data hazard. :::success A: lw x13, 0(x11) addi **x11**, x11, 4 add x12, x12, x13 bne **x11**, x14, A ::: The **x11** has the RAW data dependency, so we use bypass to avoid Data hazard. #### (3) Are there points during the execution of the sequence when the pipeline is stalled? If so, give the instruction(s) in the DEC stage at each such point; otherwise, enter `N/A`. * Instruction(s) in DEC, or `N/A`: __ E03 __ > E03 : ==`add`== > :::spoiler > ![](https://i.imgur.com/pyHrOHm.png) > the contents of the WB stage > ::: ---