owned this note
owned this note
Published
Linked with GitHub
# CS:APP Homework Problems 5.13
###### tags: `linux2020` `sysprog2020`
## 5.13 題目
Suppose we wish to write a procedure that computes the inner product of two vectors u and v. An abstract version of the function has a CPE of 14–18 with x86-64 for different types of integer and floating-point data. By doing the same sort of transformations we did to transform the abstract program combine1 into the more efficient combine4, we get the following code:
``` c=
void inner4(vec_ptr u, vec_ptr v, data_t *dest)
{
long i;
long length = veclength(u)
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
data_t sum = (data_t) 0;
for(i = 0; i < length; i++){
sum = sum + udata[i] * vdata[i];
}
*dest = sum;
}
```
Our measurements show that this function has CPEs of 1.50 for integer data and 3.00 for floating-point data. For data type double, the x86-64 assembly code for the inner loop is as follows:
```
(Inner loop of inner4. data_t = double, OP = * udata in %rbp, vdata in %rax, sum in %xmm0 i in %rcx, limit in %rbx)
.L15: loop:
vmovsd 0(%rbp,%rcx,8), %xmm1 Get udata[i]
vmulsd (%rax,%rcx,8), %xmm1, %xmm1 Multiply by vdata[i]
vaddsd %xmm1, %xmm0, %xmm0 Add to sum
addq $1, %rcx Increment i
cmpq %rbx, %rcx Compare i:limit
jne .L15 If !=, goto loop
```
Assume that the functional units have the characteristics listed in Figure 5.12.
* A. Diagram how this instruction sequence would be decoded into operations and show how the data dependencies between them would create a critical path of operations, in the style of Figures 5.13 and 5.14.
* B. For data type double, what lower bound on the CPE is determined by the critical path?
* C. Assuming similar instruction sequences for the integer code as well, what lower bound on the CPE is determined by the critical path for integer data?
* D. Explain how the floating-point versions can have CPEs of 3.00, even though the multiplication operation requires 5 clock cycles.
## 針對 5.13 的相關說明
Figure 5.12

Figures 5.13

Figures 5.14

* 程式碼 1: combine1
``` c
/* Implementation with maximum use of data abstraction */
void combine1(vec_ptr v, data_t *dest)
{
long i;
*dest = IDENT;
for (i = 0; i < vec_length(v); i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}
```
**Figure 5.5** Initial implementation of combining operation. Using different declarations of identity element IDENT and combining operation OP, we can measure the routine for different operations.
* combine2
``` c
/* Move call to vec_length out of loop */
void combine2(vec_ptr v, data_t *dest)
{
long i;
long length = vec_length(v);
*dest = IDENT;
for (i = 0; i < length; i++) {
data_t val;
get_vec_element(v, i, &val);
*dest = *dest OP val;
}
}
```
::: success
**Figure 5.6** Improving the efficiency of the loop test. By moving the call to vec_ length out of the loop test, we eliminate the need to execute it on every iteration.
:::
* 程式碼 2: combine1
``` c
/* Accumulate result in local variable */
void combine4(vec_ptr v, data_t *dest)
{
long i;
long length = vec_length(v);
data_t *data = get_vec_start(v);
data_t acc = IDENT;
for (i = 0; i < length; i++) {
acc = acc OP data[i];
}
*dest = acc;
}
```
**Figure 5.10** Accumulating result in temporary. Holding the accumulated value in local variable acc (short for “accumulator”) eliminates the need to retrieve it from memory and write back the updated value on every loop iteration.
## code:
``` c=
/*
* 5.13.c
*/
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include "vec.h"
#define OP *
void inner1(vec_ptr u, vec_ptr v, data_t *dest) {
long i;
for (i = 0; i < vec_length(u); i++) {
data_t udata;
get_vec_element(u, i, &udata);
data_t vdata;
get_vec_element(u, i, &vdata);
*dest += udata OP vdata;
}
}
void inner2(vec_ptr u, vec_ptr v, data_t *dest) {
data_t i;
data_t length = vec_length(u);
for (i = 0; i < length; i++) {
data_t udata;
get_vec_element(u, i, &udata);
data_t vdata;
get_vec_element(u, i, &vdata);
*dest += udata OP vdata;
}
}
void inner3(vec_ptr u, vec_ptr v, data_t *dest) {
data_t i;
data_t length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
for (i = 0; i < length; i++) {
*dest += udata[i] OP vdata[i];
}
}
void inner4(vec_ptr u, vec_ptr v, data_t *dest) {
data_t i;
data_t length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
data_t sum = 0;
for (i = 0; i < length; i++) {
sum += udata[i] OP vdata[i];
}
*dest = sum;
}
/* inner product. accumulate in temporary */
void inner4_t15(vec_ptr u, vec_ptr v, data_t *dest) {
long i;
long length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
data_t sum = (data_t) 0;
for (i = 0; i < length; i++) {
sum = sum + udata[i] * vdata[i];
}
*dest = sum;
}
int main(int argc, char* argv[]) {
data_t length = 4;
vec_ptr u = new_vec(length);
vec_ptr v = new_vec(length);
data_t *dest = malloc(sizeof(data_t));
*dest = 0;
inner1( u, v, dest);
printf("Result of inner1 is: %ld\n", *dest);
*dest = 0;
inner2( u, v, dest);
printf("Result of inner2 is: %ld\n", *dest);
*dest = 0;
inner3( u, v, dest);
printf("Result of inner3 is: %ld\n", *dest);
*dest = 0;
inner4( u, v, dest);
printf("Result of inner4 is: %ld\n", *dest);
free(u->data);
free(v->data);
free(u);
free(v);
free(dest);
return 0;
}
```