Try   HackMD

Case Study on Performance Improvements

陳奕嘉

GitHub

Merge Sort

Merge sort is a basic sorting algorithm. It is a sorting method that belongs to the Devide and Conquer method. Its characteristics are stable sorting algorithm and the time complexity is O(nlogn).

Basic Concept

Merge Sort works by dividing the sequence to be sorted into smaller subsequences, sorting each subsequence, and finally merging them to generate a complete sorting result.

Top-Down and Bottom-up method

  • Top-Down : Use recursion to continuously divide the array until each subsequence has only one element, and then merge these subsequences from small to large to gradually build the sorted array.

    • Disadvantages: Will consume the stacking space of the topology
  • Bottom-up : Without recursion, start sorting from the smallest subsequence and gradually build larger sorted subsequences until the entire array is sorted.

    • Advantages : Avoids recursion, is suitable for non-recursive environments, and has low stack space requirements.
    • Disadvantage :The program structure is slightly complex and the merging strategy requires additional attention.

Compare the impact of Top-Down and Bottom-Up methods on Cache usage

Top-Down Approach

1.Memory Access Pattern:

  • The Top-Down approach uses recursion to divide the array into smaller subarrays, working from the middle of the array toward smaller pieces.
  • Each recursive call operates on non-contiguous subarrays, resulting in non-linear memory access patterns that can lead to higher cache misses.

2.Memory Overhead from Recursion:

  • Each recursive call allocates additional stack memory, which increases memory usage and can interfere with efficient cache utilization.
  • Deep recursion may cause data to be repeatedly loaded into and evicted from the cache, reducing its efficiency.

3.Merge Process:

  • The merging process involves copying subarrays into temporary arrays and operating across different memory regions, increasing the likelihood of cache misses.

Bottom-Up Approach

1.Memory Access Pattern:

  • The Bottom-Up approach starts by merging subarrays of size 1 and gradually increases the size (e.g., 2, 4, 8, ).

  • Memory access patterns are typically sequential, making it more cache-efficient and easier to leverage prefetching mechanisms.

  1. Without any Recursion Overhead:
  • Bottom-Up uses an iterative structure, eliminating the need for recursion and its associated stack overhead.
  • This reduces the memory pressure on the cache and improves overall cache utilization.

3.Merge Process:

  • During each iteration, the subarrays to be merged are generally adjacent in memory.
  • This contiguous access pattern significantly reduces cache misses compared to the Top-Down approach.

Why Bottom-Up is More Cache-Efficient

1.Sequential Memory Access:

  • Top-Down: Operates on non-contiguous memory regions, resulting in "jumpy" access patterns that lead to higher cache misses.
  • Bottom-Up: Operates on contiguous memory regions, resulting in "sequential" access patterns that are cache-friendly.

2.Cache Prefetching:

  • Modern processors utilize cache prefetching to optimize sequential memory access.
  • Top-Down's irregular access patterns hinder prefetching, whereas Bottom-Up benefits from sequential access, triggering prefetching mechanisms effectively.

3.Recursion Overhead:

  • Top-Down involves recursive stack allocation, which adds to memory and cache pressure.
  • Bottom-Up eliminates recursion, allowing the cache to focus solely on data operations.

4.Data Reuse:

  • Bottom-Up: Operates on smaller, adjacent subarrays, increasing data reuse and reducing cache misses.
  • Top-Down: Subarray operations are spread across the memory, reducing reuse opportunities.

Comparison Table

Feature Top-Down Approach Bottom-Up Approach
Memory Access Pattern Irregular, scattered, higher cache miss rate Sequential, contiguous, lower cache miss rate
Cache Prefetching Inefficient due to unpredictable patterns Efficient due to sequential access
Recursion Overhead Requires stack memory, increases cache load No recursion, reduces cache pressure
Data Reuse Low, due to non-contiguous subarray access High, due to contiguous subarray access
Best Use Cases Small arrays or scattered data Large arrays or high-performance requirements

C Code: Top-Down

#include<stdio.h>
#include<stdlib.h>

void merge(int *arr, int start, int mid, int end)
{
	int total = end - start + 1;
	//int *array = (int *)malloc(sizeof(int)*total);
	int array[total];  // variable array length stored in stack

	for(int i = 0; i< total;i++)
		array[i] = arr[i+start];

	int left_c = 0;
	int right_c = mid-start+1;
	int l_max = mid-start;
	int r_max = end-start;
	int output_c = start;

	while(left_c <= l_max && right_c <= r_max)
	{
		if(array[left_c] <= array[right_c])
		{
			arr[output_c] = array[left_c];
			output_c++;
			left_c++;
		}
		else
		{
			arr[output_c] = array[right_c];
			output_c++;
			right_c++;
		}
	}
	if(left_c <= l_max)
	{
		while(left_c <= l_max)
		{
			arr[output_c] = array[left_c];
			output_c++;
			left_c++;
		}
	}
	else
	{
		while(right_c <= r_max)
		{
			arr[output_c] = array[right_c];
			output_c++;
			right_c++;
		}
	}


}

void mergesort(int *arr, int start, int end)
{
	if(start<end)
	{
		int mid = (end+start)/2;
		mergesort(arr, start, mid);
		mergesort(arr, mid+1, end);
		merge(arr, start, mid, end);
	}
}

int main()
{
	int arr[] = {7, -4, 10, -1, 3, 2, -6, 9};
	int size = 8;

	printf("Before Sort : ");
	for(int i = 0;i<8;i++)
		printf("%d , ", arr[i]);
	printf("\n");

	mergesort(arr, 0, size-1);

	for(int i = 0;i<8;i++)
		printf("%d , ", arr[i]);
	printf("\n");

	return 0;
}


  • merge function:
    1.Create a temporary array array to store elements from start to end (to avoid directly modifying the original array, making merging easier).

    2.left_c and right_c to track the current positions in the left and right subarrays, respectively.l_max and r_max to indicate the maximum indexes of the left and right subarrays.

    3.If the element in the left subarray is smaller or equal to the element in the right subarray, copy it back to the original array and move the left_c pointer.Otherwise, copy the element from the right subarray and move the right_c pointer.

    4.After finishing comparisons, copy any remaining elements from the left or right subarray back to the original array.

  • mergesort function:
    1. If start < end

    • Calculate the middle index mid = (start + end) / 2.
    • Recursively call mergesort(arr, start, mid) to sort the left subarray.
    • Recursively call mergesort(arr, mid + 1, end) to sort the right subarray.
    • Merge the sorted left and right subarrays using the merge function.
    1. If start >= end, the subarray contains only one element (which is inherently sorted), so the recursion stops.

Mergesort Top-down

.data size: .word 8 arr: .word 7, 4, 10, -1, 3, 2, -6, 9 str1: .string " Before Sort : " str2: .string " After Sort : " newline: .string "\n" comma: .string " , " .text ############## main ################ main: addi sp, sp, -4 sw ra, 0(sp) # print str1 la a0, str1 addi a7, x0, 4 ecall # print current array la a0, arr lw a1, size jal ra, printArray # do MergeSort(&arr, int start, int end) la a0, arr addi a1, x0, 0 lw a2, size addi a2, a2, -1 jal ra, mergeSort # print str2 la a0, str2 addi a7, x0, 4 ecall # print current array la a0, arr lw a1, size jal ra, printArray lw ra, 0(sp) addi sp, sp, 4 addi a7, x0, 10 ecall jalr x0, ra, 0 ########################################### ############ mergeSort ############## # s0 : &addr # s1 : start # s2 : end # s3 : mid = (start+end)/2 mergeSort: # sp : ra # sp+4 : &addr # sp+8 : start # sp+12 : end # sp+16 : mid addi sp, sp, -20 sw ra, 0(sp) # copy argument to saved register addi s0, a0, 0 # s0 : &addr addi s1, a1, 0 # s1 : start addi s2, a2, 0 # s2 : end # if(start<end), else go to label(endMergeSort) bge s1, s2, endMergeSort # use substract to implement divide # s3 <= mid = (start+end)/2 add s3, s1, s2 # mid(s3) = start(s1) + end(s2) addi s4, x0, 0 # (s4+1) each time when (s3-2) until s3 < 0 div: blt s3, x0, endDiv addi s3, s3, -2 # each time s3 - 2 addi s4, s4, 1 # each time s4 + 1 jal x0, div endDiv: addi s3, s4, -1 # s3 = mid : floor((start+end)/2) # save &addr(s0), start(s1), end(s2), mid(s3) to mem sw s0, 4(sp) sw s1, 8(sp) sw s2, 12(sp) sw s3, 16(sp) # prepare argument to do left mergeSort # mergeSort(&arr(a0), start(a1), mid(a2)) addi a0, s0, 0 # a0 <= &addr (s0) addi a1, s1, 0 # a1 <= start (s1) addi a2, s3, 0 # a2 <= mid (s3) jal ra, mergeSort # prepare argument to do right mergeSort # mergeSort(&arr(a0), mid(a1)+1, end(a2)) lw a0, 4(sp) # a0 <= &addr (sp+4) lw a1, 16(sp) # a1 <= mid (sp+16) addi a1, a1, 1 # a1 <= mid + 1 lw a2, 12(sp) # a2 <= end (sp+12) jal ra, mergeSort # prepare argument to do merge # mergeSort(&arr(a0), start(a1), mid(a2), end(a3)) lw a0, 4(sp) # a0 <= &addr (sp+4) lw a1, 8(sp) # a1 <= start (sp+8) lw a2, 16(sp) # a2 <= mid (sp+16) lw a3, 12(sp) # a3 <= end (sp+12) jal ra, merge endMergeSort: lw ra, 0(sp) addi sp, sp, 20 jalr x0, ra, 0 ########################################### ########### merge ################ # a0 : &addr # a1 : start # a2 : mid # a3 : end merge: # 0. we have left sorted array(start..mid) & right sorted array(mid+1..end) # 1. copy array[start..end] from memory to stack # 2. use 2 index (left: from 0, right : from mid-start+1) to compare left & right in stack # 3. write the smaller back to memory, and the smaller index + 1 # 4. check is there any index have reached(>) the max index(left: mid-start, right: end-start) # 5. if no, go back to step 2; else write the left/right left value back to memory # calculate how many stack space we need # t0 : total count of index => end - start + 1 # t1 : total space of memsize => t0*4 # t2 : 2's complement of t1 => -t1 sub t0, a3, a1 # total(t0) = end(a3) - start(a1) addi t0, t0, 1 # total(t0) = end(a3) - start(a1) + 1 add t1, t0, t0 # space(t1) = total(t0)*2 add t1, t1, t1 # space(t1) = space(t1)*2 = total(t0)*4 xori t2, t1, 0xffffffff # 1's complement + -> -, t1 -> t2 addi t2, t2, 1 # 2's complement(t2) => 1's complement(t2) + 1 add sp, sp, t2 # copy array[start..end] from memory to stack # t2 : current index of new array(stack) # t3 : current index of old array(memory) addi t3, a1, 0 # memory index(t3) start from start(a1) addi t2, x0, 0 # stack index(t2) start from 0 read2Stack: blt a3, t3, endRead # if(mem_index(t3) < end(a3)), else go to label(endRead) add t4, t3, t3 # offset_space(t4) = mem_index(t3)*2 add t4, t4, t4 # offset_space(t4) = offset_space(t4)*2 = mem_index(t3)*4 add t4, a0, t4 # mem_addr(t4) = &addr(a0) + offset_space(t4) lw t5, 0(t4) # t5 = *mem_addr(t4) add t6, t2, t2 # offset_space(t6) = st_index(t2)*2 add t6, t6, t6 # offset_space(t6) = offset_space(t6)*2 = st_index(t2)*4 add t6, sp ,t6 # st_addr(t6) = initial_addr(sp) + offset_space(t6) sw t5, 0(t6) # *st_addr(t6) = t5 addi t2, t2, 1 # st_index + 1 addi t3, t3, 1 # mem_index + 1 jal x0, read2Stack endRead: # do merge # view stack as two parts(left sorted & right sorted), compare each, then store to memory from little to big # prepare some variable # t2 : left part index => 0 # t3 : right part index => mid(a2) - start(a1) + 1 => t4 + 1 # t4 : left terminate index(when t2 > t4) => mid(a2) - start(a1) # t5 : right terminate index(when t3 > t5) => end(a3) - start(a1) # t6 : write back index of old array(memory) => start(a1) sub t4, a2, a1 # left max(t4) = mid(a2) - start(a1) sub t5, a3, a1 # right max(t5) = end(a3) - start(a1) addi t2, x0, 0 # left index(t2) = start form 0 addi t3, t4, 1 # right index(t3) = start from mid(a2) - start(a1) + 1 addi t6, a1, 0 # t6 : write back index from start(a1) # if((left index <= left terminate) && (right index <= right terminate)), else go to label(endMergeLoop) # t0 : boolean <= (left terminate(t4) < left index(t2)) # t1 : boolean <= (right terminate(t5) < right index(t3)) # if t0 or t1 is true, it will be go to go to label(endMergeLoop) # if neither => go on # t0 : calculate the left mem, then load value in it # t1 : calculate the right mem, then load value in it # compare which is smaller, then save the smaller in old array(memory) mergeLoop: slt t0, t4, t2 # t0 = (left terminate(t4) < left index(t2)) slt t1, t5, t3 # t1 = (right terminate(t5) < right index(t3)) or t0, t0, t1 # t0 = t0 || t1 xori t0, t0, 0x1 # t0 = ~t0 beq t0, x0, endMergeLoop # if((t0||t1) != 0), else go to label(endMergeLoop) add t0, t2, t2 # offset_space(t0) = left_index(t2)*2 add t0, t0, t0 # offset_space(t0) = offset_space(t0)*2 = left_index(t2)*4 add t0, sp, t0 # left_addr(t0) = initial_addr(sp) + offset_space(t0) lw t0, 0(t0) # left_value(t0) = *left_addr(t0) add t1, t3, t3 # offset_space(t1) = right_index(t3)*2 add t1, t1, t1 # offset_space(t1) = offset_space(t1)*2 = right_index(t3)*4 add t1, sp, t1 # right_addr(t1) = initial_addr(sp) + offset_space(t1) lw t1, 0(t1) # right_value(t1) = *right_addr(t1) blt t1, t0, rightSmaller # if(left_value(t0)<=right_value(t1)), else go to label(rightSmaller) add t1, t6, t6 # offset_space(t1) = mem_index(t6)*2 add t1, t1, t1 # offset_space(t1) = offset_space(t1)*2 = mem_index(t6)*4 add t1, a0, t1 # mem_addr(t1) = &addr(a0) + offset_space(t1) sw t0, 0(t1) # *mem_addr(t1) = left_value(t0) addi t6, t6, 1 # mem_index + 1 addi t2, t2, 1 # left_index + 1 jal x0, mergeLoop rightSmaller: add t0, t6, t6 # offset_space(t0) = mem_index(t6)*2 add t0, t0, t0 # offset_space(t0) = offset_space(t0)*2 = mem_index(t6)*4 add t0, a0, t0 # mem_addr(t0) = &addr(a0) + offset_space(t0) sw t1, 0(t0) # *mem_addr(t0) = right_value(t1) addi t6, t6, 1 # mem_index + 1 addi t3, t3, 1 # right_index + 1 jal x0, mergeLoop endMergeLoop: # if left part still have elements, store leave in to old array # if right part still have elements, store leave in to old array bge t5, t3, rightLoop # if(right index <= right terminate) => right part still have elements, go to label(rightLoop) leftLoop: # else go on label(leftLoop) add t0, t2, t2 # offset_space(t0) = left_index(t2)*2 add t0, t0, t0 # offset_space(t0) = offset_space(t0)*2 = left_index(t2)*4 add t0, sp, t0 # left_addr(t0) = initial_addr(sp) + offset_space(t0) lw t0, 0(t0) # left_value(t0) = *left_addr(t0) add t1, t6, t6 # offset_space(t1) = mem_index(t6)*2 add t1, t1, t1 # offset_space(t1) = offset_space(t1)*2 = mem_index(t6)*4 add t1, a0, t1 # mem_addr(t1) = &addr(a0) + offset_space(t1) sw t0, 0(t1) # *mem_addr(t1) = left_value(t0) addi t6, t6, 1 # mem_index + 1 addi t2, t2, 1 # left_index + 1 bge t4, t2, leftLoop # check left part whether still have elements jal x0, endMerge rightLoop: add t1, t3, t3 # offset_space(t1) = right_index(t3)*2 add t1, t1, t1 # offset_space(t1) = offset_space(t1)*2 = right_index(t3)*4 add t1, sp, t1 # right_addr(t1) = initial_addr(sp) + offset_space(t1) lw t1, 0(t1) # right_value(t1) = *right_addr(t1) add t0, t6, t6 # offset_space(t0) = mem_index(t6)*2 add t0, t0, t0 # offset_space(t0) = offset_space(t0)*2 = mem_index(t6)*4 add t0, a0, t0 # mem_addr(t0) = &addr(a0) + offset_space(t0) sw t1, 0(t0) # *mem_addr(t0) = right_value(t1) addi t6, t6, 1 # mem_ndex + 1 addi t3, t3, 1 # right_index + 1 bge t5, t3, rightLoop # check right part whether still have elements jal x0, endMerge # revert stack space # t0 : total count of index => end - start + 1 # t1 : total space of memsize => t0*4 endMerge: sub t0, a3, a1 # total(t0) = end(a3) - start(a1) addi t0, t0, 1 # total(t0) = end(a3) - start(a1) + 1 add t1, t0, t0 # space(t1) = total(t0)*2 add t1, t1, t1 # space(t1) = space(t1)*2 = total(t0)*4 add sp, sp, t1 jalr x0, ra, 0 ########################################### ########### printArray ############### printArray: addi t0, a0, 0 addi t1, a1, 0 addi t2, x0, 0 printLoop: add t3, t2, t2 # i+=i add t3, t3, t3 # i+=i ==> i*4 add t3, t0, t3 lw a0, 0(t3) addi a7, x0, 1 ecall addi t2, t2, 1 bge t2, t1, endPrint la a0, comma addi a7, x0, 4 ecall jal x0, printLoop endPrint: la a0, newline addi a7, x0, 4 ecall jalr x0, ra, 0 ###########################################
  • mergeSort Function:
    The mergeSort function is the recursive implementation of Merge Sort. It is responsible for continuously splitting the array into smaller subarrays until each subarray contains only one element. When start >= end, the recursion ends, as the subarray is already sorted. The midpoint (mid) is calculated using a simulated division operation. After splitting, the left and right subarrays are recursively sorted, and then the merge function is called to combine the two sorted subarrays.
  • merge Function
    The merge function is responsible for merging two sorted subarrays into a single sorted array. It begins by copying the data within the start to end range into the stack for temporary storage. Then, it uses two pointers to traverse the left and right subarrays, comparing the current elements from each subarray. The smaller element is written back to the original array, and the corresponding pointer is moved forward. If one subarray is exhausted, the remaining elements of the other subarray are directly copied back into the original array. After merging, the allocated stack space is released.

Mergesort Bottom-up

.data size: .word 8 arr: .word 7, 4, 10, -1, 3, 2, -6, 9 str1: .string " Before Sort : " str2: .string " After Sort : " newline: .string "\n" comma: .string " , " .text main: addi sp, sp, -4 sw ra, 0(sp) # print str1 la a0, str1 addi a7, x0, 4 ecall # print current array la a0, arr lw a1, size jal ra, printArray # do iterative MergeSort la a0, arr lw a1, size jal ra, iterativeMergeSort # print str2 la a0, str2 addi a7, x0, 4 ecall # print sorted array la a0, arr lw a1, size jal ra, printArray lw ra, 0(sp) addi sp, sp, 4 addi a7, x0, 10 ecall jalr x0, ra, 0 # Iterative MergeSort # a0: array address # a1: array size iterativeMergeSort: addi sp, sp, -24 sw ra, 0(sp) sw s0, 4(sp) # array address sw s1, 8(sp) # size sw s2, 12(sp) # current width sw s3, 16(sp) # left start sw s4, 20(sp) # iteration counter mv s0, a0 # save array address mv s1, a1 # save size # Start with merging subarrays of size 1 addi s2, x0, 1 # width starts from 1 width_loop: bge s2, s1, end_sort # if width >= size, we're done # For each width, merge subarrays addi s3, x0, 0 # left = 0 left_loop: # Calculate right start and end add t0, s3, s2 # mid = left + width add t1, t0, s2 # right_end = mid + width # Ensure right_end doesn't exceed array size bge t1, s1, adjust_end j do_merge adjust_end: mv t1, s1 # right_end = size do_merge: # If there are elements to merge bge s3, t0, next_iteration # Prepare arguments for merge mv a0, s0 # array address mv a1, s3 # left start addi a2, t0, -1 # mid - 1 addi a3, t1, -1 # right_end - 1 jal ra, merge next_iteration: add s3, s3, s2 # left += width add s3, s3, s2 # left += width (total: left += 2*width) blt s3, s1, left_loop # Double the width for next iteration add s2, s2, s2 # width *= 2 j width_loop end_sort: lw ra, 0(sp) lw s0, 4(sp) lw s1, 8(sp) lw s2, 12(sp) lw s3, 16(sp) lw s4, 20(sp) addi sp, sp, 24 jalr x0, ra, 0 # Merge function (same as original) merge: sub t0, a3, a1 addi t0, t0, 1 add t1, t0, t0 add t1, t1, t1 xori t2, t1, 0xffffffff addi t2, t2, 1 add sp, sp, t2 addi t3, a1, 0 addi t2, x0, 0 read2Stack: blt a3, t3, endRead add t4, t3, t3 add t4, t4, t4 add t4, a0, t4 lw t5, 0(t4) add t6, t2, t2 add t6, t6, t6 add t6, sp, t6 sw t5, 0(t6) addi t2, t2, 1 addi t3, t3, 1 j read2Stack endRead: sub t4, a2, a1 sub t5, a3, a1 addi t2, x0, 0 addi t3, t4, 1 addi t6, a1, 0 mergeLoop: slt t0, t4, t2 slt t1, t5, t3 or t0, t0, t1 xori t0, t0, 0x1 beq t0, x0, endMergeLoop add t0, t2, t2 add t0, t0, t0 add t0, sp, t0 lw t0, 0(t0) add t1, t3, t3 add t1, t1, t1 add t1, sp, t1 lw t1, 0(t1) blt t1, t0, rightSmaller add t1, t6, t6 add t1, t1, t1 add t1, a0, t1 sw t0, 0(t1) addi t6, t6, 1 addi t2, t2, 1 j mergeLoop rightSmaller: add t0, t6, t6 add t0, t0, t0 add t0, a0, t0 sw t1, 0(t0) addi t6, t6, 1 addi t3, t3, 1 j mergeLoop endMergeLoop: bge t5, t3, rightLoop leftLoop: add t0, t2, t2 add t0, t0, t0 add t0, sp, t0 lw t0, 0(t0) add t1, t6, t6 add t1, t1, t1 add t1, a0, t1 sw t0, 0(t1) addi t6, t6, 1 addi t2, t2, 1 bge t4, t2, leftLoop j endMerge rightLoop: add t1, t3, t3 add t1, t1, t1 add t1, sp, t1 lw t1, 0(t1) add t0, t6, t6 add t0, t0, t0 add t0, a0, t0 sw t1, 0(t0) addi t6, t6, 1 addi t3, t3, 1 bge t5, t3, rightLoop endMerge: sub t0, a3, a1 addi t0, t0, 1 add t1, t0, t0 add t1, t1, t1 add sp, sp, t1 jalr x0, ra, 0 # Print Array function (same as original) printArray: addi t0, a0, 0 addi t1, a1, 0 addi t2, x0, 0 printLoop: add t3, t2, t2 add t3, t3, t3 add t3, t0, t3 lw a0, 0(t3) addi a7, x0, 1 ecall addi t2, t2, 1 bge t2, t1, endPrint la a0, comma addi a7, x0, 4 ecall j printLoop endPrint: la a0, newline addi a7, x0, 4 ecall jalr x0, ra, 0
  • iterativeMergeSort function:
    The core logic of iterativeMergeSort lies in simulating the recursive divide-and-conquer process by using loops to gradually merge subarrays. Initially, the size of each subarray is set to 1, meaning it starts by merging single-element subarrays. As the iterations progress, the size of the subarrays doubles—for example, from size 1 to size 2, then to size 4, and so on—until the entire array is completely sorted. In each iteration, the program calculates the left and right boundaries of the current subarrays to be processed, ensuring the boundaries do not exceed the array's size. It then calls the merge function to combine the left and right subarrays into a single sorted subarray.
  • mergefunction:
    The merge function is responsible for merging two sorted subarrays. It begins by copying the elements of the subarrays into the stack as a temporary storage area, and then uses the two-pointer method to traverse both subarrays simultaneously, writing the smaller element back into the original array one at a time. Once one of the subarrays is fully processed, the remaining elements from the other subarray are directly copied into the array. Finally, the temporary stack space is released to prevent memory wastage.

Performance Analysis

Execution Info

Top-Down Bottoom up
Cycles 2032 1616
Instrs. retired 1564 1280
CPI 1.3 1.26
IPC 0.77 0.792
  • Starting with Cycles, the Top-Down approach takes 2032 cycles to complete execution, whereas the Bottom-Up approach requires only 1616 cycles, indicating that the Bottom-Up method is faster overall. Looking at Instructions Retired, the Top-Down approach completes 1564 instructions, while the Bottom-Up approach only needs 1280 instructions to accomplish the same task. This shows that Bottom-Up is more efficient in design, reducing the number of operations needed for execution.
  • In terms of CPI (the average number of cycles required per instruction), the Top-Down approach has a CPI of 1.3, while the Bottom-Up approach achieves a lower CPI of 1.26. This means that the Bottom-Up approach requires fewer cycles per instruction on average, indicating slightly higher efficiency in the processor's pipeline. On the other hand, looking at IPC (the number of instructions executed per cycle), the Bottom-Up approach achieves 0.792, which is higher than the 0.77 achieved by the Top-Down approach. This demonstrates that the Bottom-Up method utilizes processor resources and instruction parallelism more effectively.

Cache configuration

Repl.policy LRU
Wr. hit Write-back
Wr. miss Write allocate

Statistics

Top-Down Bottoom up
Hit rate 0.9627 0.9521
Hits 232 139
Misses 9 7
  • While the Top-Down method achieves a slightly higher hit rate of 96.27% compared to Bottom-Up's 95.21%, the Bottom-Up approach significantly reduces the total number of memory accesses, with only 146 accesses (139 hits and 7 misses) compared to Top-Down's 241 accesses (232 hits and 9 misses). This indicates that Bottom-Up requires far fewer memory operations during execution, primarily due to its more structured and continuous memory access pattern . As a result of the reduced memory access frequency and better handling of cache misses, the Bottom-Up approach achieves higher overall efficiency, which is reflected in its lower total cycle count.

DFS

Basic concept

Depth-First Search (DFS) is a classic algorithm for traversing a graph or tree. Its key characteristic is to explore as deeply as possible along one branch before backtracking to explore other unvisited branches. Recursive implementation is a common and natural way to implement DFS, as it leverages the function call stack to handle backtracking.

Disadvantage of using Recursive DFS

1. Stack Overflow

  • Recursive DFS relies on the function call stack to manage the traversal. For each recursive call, a new frame is pushed onto the stack.
  • If the graph is large or has deep paths (e.g., in a tree-like structure), the recursion depth can exceed the system's stack limit, leading to a stack overflow error.

2.Inefficient Memory Usage

  • Each recursive call stores information on the call stack, such as:Parameters passed to the function,Local variables,Return addresses.
  • As the recursion depth increases, the stack memory usage grows linearly with the depth.

DFS C code

#include <stdio.h> #include <stdlib.h> // Node structure for adjacency list struct Node { int dest; struct Node* next; }; // Structure to represent an adjacency list struct AdjList { struct Node* head; }; // Function to create a new adjacency list node struct Node* createNode(int dest) { struct Node* newNode = (struct Node*)malloc(sizeof(struct Node)); newNode->dest = dest; newNode->next = NULL; return newNode; } // Recursive function for DFS traversal void DFSRec(struct AdjList adj[], int visited[], int s) { // Mark the current vertex as visited visited[s] = 1; // Print the current vertex printf("%d ", s); // Traverse all adjacent vertices that are not visited yet struct Node* current = adj[s].head; while (current != NULL) { int dest = current->dest; if (!visited[dest]) { DFSRec(adj, visited, dest); } current = current->next; } } // Main DFS function that initializes the visited array // and calls DFSRec void DFS(struct AdjList adj[], int V, int s) { // Initialize visited array to false int visited[5] = {0}; DFSRec(adj, visited, s); } // Function to add an edge to the adjacency list void addEdge(struct AdjList adj[], int s, int t) { // Add edge from s to t struct Node* newNode = createNode(t); newNode->next = adj[s].head; adj[s].head = newNode; // Due to undirected Graph newNode = createNode(s); newNode->next = adj[t].head; adj[t].head = newNode; } int main() { int V = 5; // Create an array of adjacency lists struct AdjList adj[V]; // Initialize each adjacency list as empty for (int i = 0; i < V; i++) { adj[i].head = NULL; } int E = 5; // Define the edges of the graph int edges[][2] = {{1, 2}, {1, 0}, {2, 0}, {2, 3}, {2, 4}}; // Populate the adjacency list with edges for (int i = 0; i < E; i++) { addEdge(adj, edges[i][0], edges[i][1]); } int source = 1; printf("DFS from source: %d\n", source); DFS(adj, V, source); return 0; }
  • createNode function:
    The createNode function is responsible for creating a new node. It takes the destination node number dest as a parameter, allocates memory for a new Node structure, sets dest as the value of the node, and initializes its next pointer to NULL. Finally, it returns the pointer to the newly created node.

  • DFSRec function:
    The DFSRec function implements the recursive depth-first search. Starting from the given node s, it marks the node as visited and prints it. Then, it iterates through all the neighbors of node s. For each unvisited neighbor, it recursively calls DFSRec until all reachable nodes from s have been visited.

  • DFS function:
    The DFS function is the entry point for the depth-first search. It initializes a visited array visited to track which nodes have been visited, and then calls the recursive DFSRec function, starting the traversal from the specified source node s.

DFS

.data # 5x5 adjacency matrix for the graph adj_matrix: .word 0,1,1,0,0 # node 0's connections .word 1,0,1,0,0 # node 1's connections .word 1,1,0,1,1 # node 2's connections .word 0,0,1,0,0 # node 3's connections .word 0,0,1,0,0 # node 4's connections visited: .word 0,0,0,0,0 # visited array msg1: .string "DFS starting from node: " arrow: .string " -> " newline: .string "\n" .text .global main main: # Print start message la a0, msg1 li a7, 4 ecall # Print starting node (1) li a0, 1 li a7, 1 ecall la a0, newline li a7, 4 ecall # Initialize visited array la t0, visited sw zero, 0(t0) sw zero, 4(t0) sw zero, 8(t0) sw zero, 12(t0) sw zero, 16(t0) # Start DFS from node 1 li a0, 1 jal dfs # Print final newline la a0, newline li a7, 4 ecall # Exit program li a7, 10 ecall # DFS function dfs: # Save return address and s0-s2 addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) sw s2, 0(sp) # Save current node in s0 mv s0, a0 # Check if already visited la t0, visited slli t1, s0, 2 add t1, t0, t1 lw t2, 0(t1) bnez t2, dfs_return # if already visited, return # Mark as visited li t2, 1 sw t2, 0(t1) # Print current node mv a0, s0 li a7, 1 ecall # Set up for neighbor checking mv s1, zero # Initialize neighbor counter li s2, 0 # Flag to track if we found any unvisited neighbors check_neighbors: li t0, 5 beq s1, t0, after_neighbors # If checked all neighbors, go to after_neighbors # Calculate address of current edge in adjacency matrix la t0, adj_matrix li t1, 20 # 5 nodes * 4 bytes per row mul t1, s0, t1 # Get row offset add t0, t0, t1 # Get row address slli t1, s1, 2 # Get column offset add t0, t0, t1 # Get edge address lw t1, 0(t0) # Load edge value # If no edge to this neighbor, skip beqz t1, next_neighbor # Check if neighbor is unvisited la t0, visited slli t1, s1, 2 add t0, t0, t1 lw t1, 0(t0) bnez t1, next_neighbor # If visited, skip # Print arrow before visiting new node la a0, arrow li a7, 4 ecall # Visit neighbor through recursive call mv a0, s1 li s2, 1 # Set flag indicating we found an unvisited neighbor jal dfs next_neighbor: addi s1, s1, 1 # Increment neighbor counter j check_neighbors after_neighbors: beqz s2, dfs_return # If no unvisited neighbors were found, return without printing arrow dfs_return: # Restore saved registers lw s2, 0(sp) lw s1, 4(sp) lw s0, 8(sp) lw ra, 12(sp) addi sp, sp, 16 ret
  • main function:
    The main function serves as the program's entry point and is responsible for overall control flow. It starts by printing an initial message and initializing the visited array to mark all nodes as unvisited. Then, it begins the depth-first search (DFS) traversal by calling the dfs function starting from node 1. After the traversal is complete, the main function prints a newline and terminates the program using a system call.

  • dfs function:
    The dfs function implements the recursive depth-first search. Upon entering the function, it saves the current node and register states onto the stack to preserve the context during recursive calls. It then checks if the current node has already been visited. If not, it marks the node as visited and prints its value. The function proceeds to iterate over the neighbors of the current node. For any unvisited neighbor, it recursively calls dfs to continue the traversal. Once all neighbors of the current node are processed, the function restores the saved register states and returns to the previous level.

  • Neighbor Checking Logic:
    The neighbor-checking logic, embedded within the dfs function, is responsible for iterating over all neighbors of the current node. Using the adjacency matrix, the function calculates each neighbor's position and checks if there is an edge connection (matrix value of 1) and whether the neighbor is unvisited. If both conditions are met, the program outputs an arrow symbol to indicate a connection and recursively calls dfs to traverse that neighbor. Once all neighbors are processed, the checking logic ends, and dfs starts to backtrack.

Optimized DFS

.data # 5x5 adjacency matrix for the graph adj_matrix: .word 0,1,1,0,0 # node 0's connections .word 1,0,1,0,0 # node 1's connections .word 1,1,0,1,1 # node 2's connections .word 0,0,1,0,0 # node 3's connections .word 0,0,1,0,0 # node 4's connections visited: .word 0,0,0,0,0 # visited array msg1: .string "DFS starting from node: " arrow: .string " -> " newline: .string "\n" .text .global main main: # Print start message la a0, msg1 li a7, 4 ecall # Print starting node (1) li a0, 1 li a7, 1 ecall la a0, newline li a7, 4 ecall # Initialize visited array la t0, visited sw zero, 0(t0) sw zero, 4(t0) sw zero, 8(t0) sw zero, 12(t0) sw zero, 16(t0) # Start optimized DFS from node 1 li a0, 1 jal optimized_dfs # Print final newline la a0, newline li a7, 4 ecall # Exit program li a7, 10 ecall # Optimized DFS implementation # a0: current node optimized_dfs: # Save return address and s registers addi sp, sp, -16 sw ra, 12(sp) sw s0, 8(sp) sw s1, 4(sp) sw s2, 0(sp) # Save current node in s0 mv s0, a0 # Mark as visited and print current node la t0, visited slli t1, s0, 2 add t1, t0, t1 li t2, 1 sw t2, 0(t1) # Mark visited mv a0, s0 # Print node li a7, 1 ecall # Initialize neighbor counter li s1, 0 # Current neighbor check_neighbors: li t0, 5 beq s1, t0, dfs_return # If checked all neighbors, return # Load adjacency info (cached in register) la t0, adj_matrix li t1, 20 # 5 nodes * 4 bytes mul t1, s0, t1 # Row offset add t0, t0, t1 slli t1, s1, 2 # Column offset add t0, t0, t1 lw t2, 0(t0) # Load edge value # If no edge or already visited, skip beqz t2, next_neighbor # Check if neighbor is visited (reuse previously loaded visited array address) la t0, visited slli t1, s1, 2 add t1, t0, t1 lw t2, 0(t1) bnez t2, next_neighbor # Found unvisited neighbor - print arrow and visit la a0, arrow li a7, 4 ecall # Recursively visit neighbor mv a0, s1 jal optimized_dfs next_neighbor: addi s1, s1, 1 j check_neighbors dfs_return: # Restore registers and return lw s2, 0(sp) lw s1, 4(sp) lw s0, 8(sp) lw ra, 12(sp) addi sp, sp, 16 ret
  • main function:
    The main function serves as the entry point of the program and coordinates the execution flow. It starts by printing a message indicating the start of DFS traversal and initializes the visited array, marking all nodes as unvisited. It then begins the traversal from node 1 by invoking the optimized_dfs function. Once the traversal is complete, the function prints a newline and exits the program using a system call.

  • optimized_dfs function
    The optimized_dfs function is the recursive implementation of depth-first search. It begins by saving the return address and relevant registers onto the stack to preserve the context during recursion. The current node is then marked as visited in the visited array, and its value is printed. The function iterates through all potential neighbors of the current node by consulting the adjacency matrix. If a neighbor is connected (edge exists) and unvisited, the function prints an arrow symbol to indicate traversal and recursively calls itself to explore the neighbor. Once all neighbors are processed, the function restores the saved register values from the stack and returns to the previous level of recursion.

  • Neighbor Checking Logic:
    The neighbor-checking logic is embedded within the optimized_dfs function. It calculates the position of each neighbor in the adjacency matrix by determining the row and column offsets for the current node. For each neighbor, it checks if there is a connecting edge and whether the neighbor has already been visited. If both conditions are satisfied, the neighbor is identified as an unvisited connection, and the traversal proceeds to recursively explore it. This logic ensures that all reachable nodes are explored while avoiding revisits to already visited nodes.

Comparison of optimized DFS and DFS

1. Efficiency of Memory Address Calculation

Opt DFS:

# Calculate the base address of the visited array and store it in t0 la t0, visited slli t1, s0, 2 # Calculate the offset for the current node in visited add t1, t0, t1 # Compute the access address li t2, 1 sw t2, 0(t1) # Mark the current node as visited
  • The base address of visited (la t0, visited) is calculated only once and stored in a register. Subsequent operations reuse this base address, reducing redundant calculations.

  • The offset for the current node (s0) is calculated once and used directly in subsequent memory access.

DFS:

# Recalculate the base address of the visited array for every neighbor la t0, visited slli t1, s1, 2 # Calculate the offset for the neighbor node in visited add t0, t0, t1 # Compute the access address lw t1, 0(t0) # Check if it is already visited
  • The base address of visited is recalculated for every neighbor, leading to unnecessary computations and additional memory access.

2. Efficiency of Adjacency Matrix Address Calculation

Opt DFS:

# Calculate the base address of the adjacency matrix row la t0, adj_matrix li t1, 20 # Each row has 5 nodes, each node takes 4 bytes mul t1, s0, t1 # Compute the row offset for the current node add t0, t0, t1 # Compute the row base address (only calculated once)
  • The base address of the adjacency matrix row for the current node is calculated only once. The result is stored in a register and reused for each neighbor, avoiding repeated calculations.
  • For each neighbor check, only the column offset needs to be added, making it efficient.

DFS:

# Recalculate the row base address for every neighbor la t0, adj_matrix li t1, 20 # Each row has 5 nodes, each node takes 4 bytes mul t1, s0, t1 # Compute the row offset for the current node add t0, t0, t1 # Compute the row base address slli t1, s1, 2 # Compute the column offset add t0, t0, t1 # Compute the final address lw t2, 0(t0) # Read the adjacency value
  • The row base address and column offset are recalculated for every neighbor check, introducing redundant operations and memory accesses.

3. Cache Hit Rate and Access Patterns

Opt DFS:

  • The base addresses of the adjacency matrix and visited array are reused multiple times, exhibiting temporal locality in cache usage.
  • Accesses to the adjacency matrix and visited array are predictable and concentrated, resulting in a high cache hit rate.

DFS:

  • Recalculating the base addresses and offsets for every neighbor leads to more scattered memory access patterns.
  • These less predictable memory access patterns result in lower cache hit rates and increased main memory access.

4. Simplified Conditional Logic

Opt DFS

# Efficiently skip already visited neighbors la t0, visited slli t1, s1, 2 add t1, t0, t1 lw t2, 0(t1) # Check if the neighbor is already visited bnez t2, next_neighbor
  • The conditional check for whether a neighbor is visited is separated from address calculations. It operates directly on already-computed values in registers, reducing complexity.

DFS:

# Mixed conditional checks and redundant calculations la t0, adj_matrix mul t1, s0, 20 # Calculate row offset add t0, t0, t1 slli t1, s1, 2 # Calculate column offset add t0, t0, t1 lw t1, 0(t0) # Check adjacency value beqz t1, next_neighbor
  • Conditional checks are intertwined with redundant address calculations, leading to additional overhead during each neighbor check.

Performance Analysis

Execution Info

DFS Opt DFS
Cycles 777 744
Instrs. retired 579 555
CPI 1.34 1.34
IPC 0.745 0.746

Opt DFS consumes fewer CPU cycles because it eliminates redundant calculations and minimizes memory accesses by efficiently reusing values stored in registers. For example, the base address of the adjacency matrix for a node is computed only once and stored in a register, which is then reused for all neighbor checks. This avoids recalculating the same base address multiple times, as seen in DFS. Similarly, the base address of the visited array is calculated once and reused, reducing the overhead associated with repeated address computations.

In DFS, every neighbor check recalculates the row base address and column offset for the adjacency matrix and re-accesses the visited array's base address. These repeated operations increase the number of executed instructions, leading to more cycles. Additionally, Opt DFS's streamlined structure allows for better instruction-level parallelism, as the CPU can execute fewer dependent instructions in sequence.

By reducing redundant operations and leveraging efficient memory and register usage, Opt DFS lowers the total instruction count and improves execution efficiency, directly reducing the number of CPU cycles required.

Cache configuration

Repl.policy LRU
Wr. hit Write-back
Wr. miss Write allocate

Statistics

DFS Opt DFS
Hit rate 0.8667 0.8588
Hits 78 73
Misses 12 12

It’s evident that while Opt DFS is optimized, its cache hit rate is slightly lower than DFS. This is because Opt DFS's optimization focuses on significantly reducing the total number of memory accesses, whereas DFS performs far more frequent memory accesses due to redundant calculations and repeated base address recalculations. These additional memory accesses in DFS increase its cache hit count and hit rate, but they also introduce unnecessary overhead.

On the other hand, both implementations have the same number of cache misses, which suggests that they both encounter memory blocks that cannot fully benefit from the cache. This is likely due to the adjacency matrix and visited array involving memory blocks that lack spatial locality, leading to some unavoidable cache misses.

The critical difference lies in the total number of memory accesses. Opt DFS reduces the number of accesses by reusing computed addresses stored in registers, while DFS repeatedly calculates and accesses memory for each neighbor. As a result, Opt DFS achieves fewer overall memory accesses and therefore consumes fewer CPU cycles, making it much more efficient than DFS.