# cahw1~4 + 1. (20 points) Describe the steps that transform a program written in a high-level language such as C into a representation that is directly executed by a computer processor. $\text{Ans}:$ 先將高階語言經過編譯器變成組合語言,在將組合語言經過組譯器變成二元機器語言 + 2. Consider two different implementations (P1 and P2) of the same instruction set architecture. The instructions can be divided into four classes (A, B, C, and D) according to their CPI. P1 with a clock rate of 2.5 GHz and CPIs of 1, 2, 3, and 3, and P2 with a clock rate of 3 GHz and CPIs of 2, 2, 2, and 2. Given a program with a dynamic instruction count of $1.0 × 10^6$ instructions divided into classes as follows: 10% class A, 20% class B, 50% class C, and 20% class D. **(a) (15 points) What is the global CPI for each implementation?** $\text{CPI}_\text{P1} = \frac{1×0.1+2×0.2+3×0.5+3×0.2}{1}=2.6$ $\text{CPI}_\text{P2} = \frac{2×0.1+2×0.2+2×0.5+2×0.2}{1}=2$ **(b) (15 points) Find the clock cycles required in both cases.** $\text{CPU Clock Cycle}_\text{P1} = 2.6\times10^6$ $\text{CPU Clock Cycle}_\text{P2} = 2.0\times10^6$ + 3. Assume a program requires the execution of $50×10^6$ FP instructions, $110 ×10^6$ INT instructions, $80 ×10^6$ L/S instructions, and $16 ×10^6$ branch instructions. The CPI for each type of instruction is 1, 1, 4, and 2, respectively. Assume that the processor has a 2 GHz clock rate. **(a) (15 points) By how much must we improve the CPI of FP instructions if we want the program to run two times faster?** 設 FP 的 CPI 需要降低到 $x$ $\frac{1}{2}(50×1+110×1+80×4+16×2)×10^6=(50x+110×1+80×4+16×2)×10^6$ 解方程式得 $x=-4.12$,故無解 **(b) (15 points) By how much must we improve the CPI of L/S instructions if we want the program to run two times faster?** 設 L/S 的 CPI 需要降低到 $x$ $\frac{1}{2}(50×1+110×1+80×4+16×2)×10^6=(50×1+110×1+80x+16×2)×10^6$ 解方程式得 $x=0.8$ , $\frac{4-0.8}{4}=80\%$ 降低 L/S 80% 的 CPI 即可 **\(c\) (20 points) By how much is the execution time of the program improved if the CPI of INT and FP instructions is reduced by 40% and the CPI of L/S and Branch is reduced by 30%?** 所求 $\frac{50×1×0.4+110×1×0.4+80×4×0.3+16×2×0.3}{50×1+110×1+80×4+16×2}=33.13\%$ 會減少33.13% --- + 1. (20 points) For the C statement, ```B[8] = A[i-j];```, what is the corresponding MIPS assembly code? Assume that the variables f, g, h, i, and j are assigned to registers $s0, $s1, $s2, $s3, and $s4, respectively. Assume that the base address of the arrays A and B are in registers $s6 and $s7, respectively. ```mips= sub $s0,$s3,$s4 sll $s0,$s0,2 add $s6,$s6,$s0 lw $s1,0($s6) sw $s1,32($s7) ``` + 2. (20 points) Translate 0xabcdef12 into decimal. $2*16^0+1*16+15*16^2+14*16^3+13*16^4+12*16^5+11*16^6+10*16^7=2882400018$ + 3. (20 points) Provide the type, assembly language instruction, and binary representation (in hexadecimal format) of instruction described by the following MIPS fields: op=0, rs=3, rt=2, rd=3, shamt=0, funct=34 Ans : R type, 組合語言如下 ```mips= sub $v1,$v1,$v0 ``` Binary 表示 $(000000\ 00011\ 00010\ 00011\ 00000\ 100010)_2$ Hexadecimal 表示 $\text{0x00621822}$ + 4. (40 points) Translate the following C code to MIPS assembly code. Use a minimum number of instructions. Assume that the value of a, b, i, and j are in registers $s0, $s1, $t0, and $t1, respectively. Also, assume that register $s2 holds the base address of the array D. ```c= for(i=0;i<a;i++) for(j=0;j<b;j++) D[4*j]=i+j ; ``` ```mips= add $t0,$zero,$zero #i=0 Loopi: slt $t2,$t0,$s0 beq $t2,$zero,Exit add $t1,$zero,$zero #j=0 Loopj: slt $t2,$t1,$s1 #t2=j<b bne $t2,$zero,Loopiplus # j!<b sll $t3,$t1,4 #t3=4*j add $t4,$s2,$t3 #t4 = addD+4*j add $t5,$t0,$t1 # t5=i+j sw $t5,0($t4) #D[4*j]=i+j addi $t1,$t1,1 #j=j+1 addi $t0,$t0,1 #i=i+1 j Loopj Loopiplus: addi $t0,$t0,1 j Loopi Exit: ``` --- + 1. (25 points) Calculate 5ED4 - 07A4 when the values represent unsigned 16bit hexadecimal numbers. The result should be written in hexadecimal. Show your work. $\begin{aligned} &\text{ }\text{5ED4}\\ -&\text{ }\text{07A4}\\ \text{——}&\text{———}\\ &\text{ }\text{5730} \end{aligned}$ + 2. (25 points) Calculate the product of the hexadecimal unsigned 8bit integers A and 32 using the hardware shown in Figure 1. The size of the Multiplicand register, ALU, and Product register is modified to 6, 6, and 12 bits, respectively. Complete the following table to show the contents of each register in each step. ![圖片](https://i.imgur.com/xdfmYZD.png =70%x) |Step|Action|Multiplicand|Product/Multiplier| |:-|:-|:-|:-| |0|Initial values|110 010 (32)|000 000 001 010 (A)| |1|lsb = 0, no operation<br>Right shift product|110 010<br>110 010|000 000 001 010<br>000 000 000 101| |2|lsb = 1, Product=Product+Multiplicand<br>Right shift product |110 010<br>110 010|110 010 000 101<br>011 001 000 010| |3|lsb = 0, no operation<br>Right shift product|110 010<br>110 010|011 001 000 010<br>001 100 100 001| |4|lsb = 1, Product=Product+Multiplicand<br>Right shift product |110 010<br>110 010|111 110 100 001<br>011 111 010 000| |5|lsb = 0, no operation<br>Right shift product|110 010<br>110 010|011 111 010 000<br>001 111 101 000| |6|lsb = 0, no operation<br>Right shift product|110 010<br>110 010|001 111 101 000<br>000 111 110 100| + 3. (20 points) What decimal number does the bit pattern 0x0C000000 represent if it is a floating point number? Use the IEEE-754 standard and show your work. $\text{0x0C000000}=\text{0000 1100 0000 0000 0000 0000 0000 0000}_2$ **sign** : $0_2=0_{10}$ **exponent** : $00011000_2=24_{10}$ **fraction** : $00000000000000000000000_2=0_{10}$ $\begin{aligned} \text{ans}&=(-1)^s\times(1+\text{frac})\times2^{\text{exp}-127}\\ &=(-1)^0\times(1+0)\times 2^{24-127}\\ &=2^{-103} \end{aligned}$ + 4. Write down the hexadecimal representation of the decimal number 63.25 in the following format. Show your work. $\begin{aligned} 63.25_{10}&=111111.01_2\\ &=1.1111101_2\times 2^5\\ \text{ } \end{aligned}$ a. (15 points) IEEE-754 single precision format **sign** : $0_2$ **fraction** : $\text{111 1101 0000 0000 0000 0000}_2$ **exponent** : $(5+127)_{10}=132_{10}=\text{1000 0100}_2$ $\begin{aligned} \text{ans} &= \text{0100 0010 0111 1101 0000 0000 0000 0000}_2\\ &=\text{427D0000}_{16}\\ \text{ } \end{aligned}$ b. (15 points) IEEE-754 double precision format **sign** : $0_2$ **fraction** : $\text{1111 1010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000}_2$ **exponent** : $(5+1023)_{10}=1028_{10}=\text{100 0000 0100}_2$ $\begin{aligned} \text{ans} &=\text{0100 0000 0100 1111 1010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 }_2\\ &=\text{404FA00000000000}_{16} \end{aligned}$ --- 1. Use the hardware shown in Figure 1 to execute the instruction,AND Rd, Rs, Rt. Inter-pretation of the instruction:Reg[Rd] = Reg[Rs] AND Reg[Rt]. ![](https://i.imgur.com/Qk843X9.png) + (a) (10 points) Refer to the following descriptions to determine the values of 7 controlsignals, RegWrite, MemRead, MemWrite, Branch, ALU operation, ALUMux, and RegMux. + ALUMux is the control signal that controls the Mux at the ALU second input,where 0 (Reg) selects the output of the register file, and 1 (Imm) selects the immediate from the instruction word. + RegMux is the control signal that controls the Mux at the Data input to the register file, where 0 (ALU) selects the output of the ALU, and 1 (Mem) selectsthe output of memory. + A value of X is a ”don’t care”. $\text{Ans : }$ 因為要寫入東西到暫存器,所以 RegWrite 為 1 因為沒有使用記憶體,所以MemRead 和 MemWrite 皆為 0 因為沒有 branch ,所以 Branch 為 0 因為使用的是 AND,所以 ALU operation 為 AND 因為 ALU 的第二個輸入來自暫存器,所以 ALUMux 為 0 因為要回傳到暫存器的是 ALU 的結果,所以 RegMux 為 0 RegWrite = 1 MemRead = 0 MemWrite = 0 Branch = 0 ALU operation = AND ALUMux = 0 RegMux = 0 + (b) (10 points) Which resources (blocks) perform a useful function for this instruction? $\text{Ans : }$**PC**, **First add unit**, **Instructor memory**, **Registers**, **ALU**, **ALUMux**, **RegMux** (下圖紅色框框圈的) + \(c\) (10 points) Which resources (blocks) produce outputs, but their outputs are not used for this instruction? Which resources produce no outputs for this instruction? $\text{Ans : }$ their outputs are not used : **Branch Add, Data memory**(下圖藍色框框圈的) produce no outputs for this instruction : **所有 resource 都會產生 output** ![](https://i.imgur.com/uM5fXQC.png) 2. Table 1 shows the latency of each stage of the datapath in a processor. Table 2 shows the breakdown of instructions in a program executed by the processor. ![](https://i.imgur.com/CXx051e.png) + (a) (10 points) What is the clock cycle time in a pipelined and non-pipelined processor? $\text{Ans : }$ pipelined processor : **350** non-pipelined processor : 250+350+150+300+250=**1300** + (b) (10 points) What is the total latency of an LW instruction in a pipelined and non-pipelined processor? $\text{Ans : }$ pipelined processor : 350*5=**1750** non-pipelined processor : 250+350+150+300+250=**1300** + \(c\) (10 points) If we can split one stage of the pipelined datapath into two new stages,each with half the latency of the original stage, which stage would you split andwhat is the new clock cycle time of the processor? $\text{Ans : }$ Split **ID** into two states of **175**ps the new clock cycle time of the processor is **300** + (d) (10 points) Assuming there are no stalls or hazards, what is the utilization of the data memory? $\text{Ans : }$ **40%** 因為 Data memory 只會用到 lw 和 sw 所以 utilization 為 $25\%+15\%=40\%$ + (e) (10 points) Assuming there are no stalls or hazards, what is the utilization of the write-register port of the ”Register” unit? $\text{Ans : }$ **65%** 因為 Write-register port 只會用到 alu 和 lw 所以 utilization 為 $40\%+25\%=65\%$ + (f) (20 points) Instead of a single-cycle organization, we can use a multi-cycle organization where each instruction takes multiple cycles but one instruction finishes before another is fetched. In this organization, an instruction only goes through stages it actually needs. For example, an ST instruction only takes 4 cycles because it does not need the WB stage. Compare clock cycle times and execution times with single-cycle, multi-cycle, and pipelined organization. $\text{Ans : }$ single cycle 的執行時間$=$指令的數量$\times \text{cycle time}_\text{single}$ single cycle 的 clock cycle time 為 1300 ps 當pipelines 的指令數量很大時 pipelined 的執行時間$\simeq$指令的數量$\times \text{cycle time}_\text{single}$ multi-cycle 的 clock cycle time 為 350 ps single-cycle 的 clock cycle time 和 執行時間為 pipelined 的 $x$ 倍 $x=\frac{\text{time}_\text{ single}} {\text{time}_\text{ pipe}}= \frac{\text{cycle time}_\text{ single}} {\text{cycle time}_\text{ pipe}}= \frac{1300}{350}\simeq3.71$ _ multi-cycle 的 clock cycle time 與 pipelined 的相同 執行時間為 pipelined 的 $y$ 倍 multi-cycle 的執行時間$=$ lw 的數量$\times5\times \text{cycle time}_\text{multi}+$其他指令的數量$\times4\times \text{cycle time}_\text{multi}$ 當pipelines 的指令數量很大時 pipelined 的執行時間$\simeq$指令的數量$\times \text{cycle time}_\text{single}$ $y=\frac{\text{time}_\text{ multi}} {\text{time}_\text{ pipe}}= \frac{0.25\times5\times\text{cycle time}_\text{ multi}+0.75\times4\times\text{cycle time}_\text{ multi}} {\text{cycle time}_\text{ pipe}}= \frac{1.25+3}{1}=4.25$ (以上以 Table 2 所給的數據) 結論: single cycle 的 clock cycle time 為 pipelined 的 **3.71** 倍 execution times 為 pipelined 的 **3.71** 倍 multi-cycle 的 clock cycle time 為 pipelined 的 **1** 倍 execution times 為 pipelined 的 **4.25** 倍