Case Study on Performance Improvements

陳奕嘉

Merge Sort

Merge sort is a basic sorting algorithm. It is a sorting method that belongs to the Devide and Conquer method. Its characteristics are stable sorting algorithm and the time complexity is O(nlogn).

Basic Concept

Merge Sort works by dividing the sequence to be sorted into smaller subsequences, sorting each subsequence, and finally merging them to generate a complete sorting result.

Top-Down and Bottom-up method

Top-Down : Use recursion to continuously divide the array until each subsequence has only one element, and then merge these subsequences from small to large to gradually build the sorted array.
- Disadvantages: Will consume the stacking space of the topology
Bottom-up : Without recursion, start sorting from the smallest subsequence and gradually build larger sorted subsequences until the entire array is sorted.
- Advantages : Avoids recursion, is suitable for non-recursive environments, and has low stack space requirements.
- Disadvantage :The program structure is slightly complex and the merging strategy requires additional attention.

Compare the impact of Top-Down and Bottom-Up methods on Cache usage

Top-Down Approach

1.Memory Access Pattern:

The Top-Down approach uses recursion to divide the array into smaller subarrays, working from the middle of the array toward smaller pieces.
Each recursive call operates on non-contiguous subarrays, resulting in non-linear memory access patterns that can lead to higher cache misses.

2.Memory Overhead from Recursion:

Each recursive call allocates additional stack memory, which increases memory usage and can interfere with efficient cache utilization.
Deep recursion may cause data to be repeatedly loaded into and evicted from the cache, reducing its efficiency.

3.Merge Process:

The merging process involves copying subarrays into temporary arrays and operating across different memory regions, increasing the likelihood of cache misses.

Bottom-Up Approach

1.Memory Access Pattern:

The Bottom-Up approach starts by merging subarrays of size 1 and gradually increases the size (e.g., 2, 4, 8, …).
Memory access patterns are typically sequential, making it more cache-efficient and easier to leverage prefetching mechanisms.

Without any Recursion Overhead:

Bottom-Up uses an iterative structure, eliminating the need for recursion and its associated stack overhead.
This reduces the memory pressure on the cache and improves overall cache utilization.

3.Merge Process:

During each iteration, the subarrays to be merged are generally adjacent in memory.
This contiguous access pattern significantly reduces cache misses compared to the Top-Down approach.

Why Bottom-Up is More Cache-Efficient

1.Sequential Memory Access:

Top-Down: Operates on non-contiguous memory regions, resulting in "jumpy" access patterns that lead to higher cache misses.
Bottom-Up: Operates on contiguous memory regions, resulting in "sequential" access patterns that are cache-friendly.

2.Cache Prefetching:

Modern processors utilize cache prefetching to optimize sequential memory access.
Top-Down's irregular access patterns hinder prefetching, whereas Bottom-Up benefits from sequential access, triggering prefetching mechanisms effectively.

3.Recursion Overhead:

Top-Down involves recursive stack allocation, which adds to memory and cache pressure.
Bottom-Up eliminates recursion, allowing the cache to focus solely on data operations.

4.Data Reuse:

Bottom-Up: Operates on smaller, adjacent subarrays, increasing data reuse and reducing cache misses.
Top-Down: Subarray operations are spread across the memory, reducing reuse opportunities.

Comparison Table

Feature	Top-Down Approach	Bottom-Up Approach
Memory Access Pattern	Irregular, scattered, higher cache miss rate	Sequential, contiguous, lower cache miss rate
Cache Prefetching	Inefficient due to unpredictable patterns	Efficient due to sequential access
Recursion Overhead	Requires stack memory, increases cache load	No recursion, reduces cache pressure
Data Reuse	Low, due to non-contiguous subarray access	High, due to contiguous subarray access
Best Use Cases	Small arrays or scattered data	Large arrays or high-performance requirements

C Code: Top-Down

#include<stdio.h>
#include<stdlib.h>

void merge(int *arr, int start, int mid, int end)
{
	int total = end - start + 1;
	//int *array = (int *)malloc(sizeof(int)*total);
	int array[total];  // variable array length stored in stack

	for(int i = 0; i< total;i++)
		array[i] = arr[i+start];

	int left_c = 0;
	int right_c = mid-start+1;
	int l_max = mid-start;
	int r_max = end-start;
	int output_c = start;

	while(left_c <= l_max && right_c <= r_max)
	{
		if(array[left_c] <= array[right_c])
		{
			arr[output_c] = array[left_c];
			output_c++;
			left_c++;
		}
		else
		{
			arr[output_c] = array[right_c];
			output_c++;
			right_c++;
		}
	}
	if(left_c <= l_max)
	{
		while(left_c <= l_max)
		{
			arr[output_c] = array[left_c];
			output_c++;
			left_c++;
		}
	}
	else
	{
		while(right_c <= r_max)
		{
			arr[output_c] = array[right_c];
			output_c++;
			right_c++;
		}
	}


}

void mergesort(int *arr, int start, int end)
{
	if(start<end)
	{
		int mid = (end+start)/2;
		mergesort(arr, start, mid);
		mergesort(arr, mid+1, end);
		merge(arr, start, mid, end);
	}
}

int main()
{
	int arr[] = {7, -4, 10, -1, 3, 2, -6, 9};
	int size = 8;

	printf("Before Sort : ");
	for(int i = 0;i<8;i++)
		printf("%d , ", arr[i]);
	printf("\n");

	mergesort(arr, 0, size-1);

	for(int i = 0;i<8;i++)
		printf("%d , ", arr[i]);
	printf("\n");

	return 0;
}

merge function:
1.Create a temporary array array to store elements from start to end (to avoid directly modifying the original array, making merging easier).

2.left_c and right_c to track the current positions in the left and right subarrays, respectively.l_max and r_max to indicate the maximum indexes of the left and right subarrays.

3.If the element in the left subarray is smaller or equal to the element in the right subarray, copy it back to the original array and move the left_c pointer.Otherwise, copy the element from the right subarray and move the right_c pointer.

4.After finishing comparisons, copy any remaining elements from the left or right subarray back to the original array.
mergesort function:
1. If start < end
- Calculate the middle index mid = (start + end) / 2.
- Recursively call mergesort(arr, start, mid) to sort the left subarray.
- Recursively call mergesort(arr, mid + 1, end) to sort the right subarray.
- Merge the sorted left and right subarrays using the merge function.
1. If start >= end, the subarray contains only one element (which is inherently sorted), so the recursion stops.

Mergesort Top-down













































































































































































































































































































.data
size: 		.word   8
arr: 		.word   7, 4, 10, -1, 3, 2, -6, 9
str1: 		.string " Before Sort : "
str2: 		.string " After Sort  : "
newline: 	.string "\n"
comma:		.string " , "

.text
##############    main    ################
main:
	addi 	sp, sp, -4
	sw	ra, 0(sp)	
	
	# print str1
	la	a0, str1
	addi 	a7, x0, 4
	ecall

	# print current array
	la	a0, arr
	lw	a1, size
	jal	ra, printArray

	# do MergeSort(&arr, int start, int end)
	la	a0, arr
	addi	a1, x0, 0
	lw	a2, size
	addi	a2, a2, -1
	jal	ra, mergeSort

	# print str2
	la	a0, str2
	addi 	a7, x0, 4
	ecall

	# print current array
	la	a0, arr
	lw	a1, size
	jal	ra, printArray
	
	lw	ra, 0(sp)
	addi	sp, sp, 4
	addi a7, x0, 10
        ecall
        jalr	x0, ra, 0
    

###########################################




############    mergeSort    ##############
# s0 : &addr
# s1 : start
# s2 : end
# s3 : mid = (start+end)/2

mergeSort:
	# sp    : ra
	# sp+4  : &addr
	# sp+8  : start
	# sp+12 : end
	# sp+16 : mid
	addi 	sp, sp, -20
	sw	ra, 0(sp)
	
	# copy argument to saved register
	addi	s0, a0, 0	# s0 : &addr
	addi	s1, a1, 0	# s1 : start
	addi	s2, a2, 0	# s2 : end

	# if(start<end), else go to label(endMergeSort)
	bge	s1, s2, endMergeSort

	# use substract to implement divide
	# s3 <= mid = (start+end)/2
	add	s3, s1, s2	# mid(s3) = start(s1) + end(s2)
	addi	s4, x0, 0	# (s4+1) each time when (s3-2) until s3 < 0
div:
	blt	s3, x0, endDiv
	addi	s3, s3, -2	# each time s3 - 2
	addi	s4, s4, 1	# each time s4 + 1
	jal	x0, div		
endDiv:
	addi	s3, s4, -1	# s3 = mid : floor((start+end)/2)
	
	# save &addr(s0), start(s1), end(s2), mid(s3) to mem
	sw	s0, 4(sp)
	sw	s1, 8(sp)
	sw	s2, 12(sp)
	sw	s3, 16(sp)
	
	# prepare argument to do left mergeSort
	# mergeSort(&arr(a0), start(a1), mid(a2))
	addi	a0, s0, 0	# a0 <= &addr (s0)
	addi	a1, s1, 0	# a1 <= start (s1)
	addi	a2, s3, 0	# a2 <= mid   (s3)
	jal	ra, mergeSort

	# prepare argument to do right mergeSort
	# mergeSort(&arr(a0), mid(a1)+1, end(a2))
	lw	a0, 4(sp)	# a0 <= &addr	(sp+4)
	lw	a1, 16(sp)	# a1 <= mid	(sp+16)
	addi	a1, a1, 1	# a1 <= mid + 1
	lw	a2, 12(sp)	# a2 <= end	(sp+12)
	jal	ra, mergeSort

	# prepare argument to do merge
	# mergeSort(&arr(a0), start(a1), mid(a2), end(a3))
	lw	a0, 4(sp)	# a0 <= &addr	(sp+4)
	lw	a1, 8(sp)	# a1 <= start	(sp+8)
	lw	a2, 16(sp)	# a2 <= mid	(sp+16)
	lw	a3, 12(sp)	# a3 <= end	(sp+12)
	jal	ra, merge

endMergeSort:	
	lw	ra, 0(sp)
	addi	sp, sp, 20
	jalr	x0, ra, 0
###########################################




###########      merge     ################
# a0 : &addr
# a1 : start 
# a2 : mid
# a3 : end

merge:
	# 0. we have left sorted array(start..mid) & right sorted array(mid+1..end)
	# 1. copy array[start..end] from memory to stack
	# 2. use 2 index (left: from 0, right : from mid-start+1) to compare left & right in stack
	# 3. write the smaller back to memory, and the smaller index + 1
	# 4. check is there any index have reached(>) the max index(left: mid-start, right: end-start)
	# 5. if no, go back to step 2; else write the left/right left value back to memory
	
	# calculate how many stack space we need
	# t0 : total count of index 	=> end - start + 1
	# t1 : total space of memsize 	=> t0*4
	# t2 : 2's complement of t1 	=> -t1
	sub	t0, a3, a1			# total(t0) = end(a3) - start(a1)
	addi	t0, t0, 1			# total(t0) = end(a3) - start(a1) + 1
	add	t1, t0, t0			# space(t1) = total(t0)*2
	add	t1, t1, t1			# space(t1) = space(t1)*2 = total(t0)*4
	xori	t2, t1, 0xffffffff  		# 1's complement  + -> -, t1 -> t2
	addi	t2, t2, 1			# 2's complement(t2)  => 1's complement(t2) + 1
	add	sp, sp, t2


	# copy array[start..end] from memory to stack
	# t2 : current index of new array(stack)
	# t3 : current index of old array(memory) 
	addi	t3, a1, 0			# memory index(t3) start from start(a1)
	addi	t2, x0, 0			# stack index(t2) start from 0
read2Stack:
	blt	a3, t3, endRead		# if(mem_index(t3) < end(a3)), else go to label(endRead)
	add	t4, t3, t3			# offset_space(t4) = mem_index(t3)*2
	add	t4, t4, t4			# offset_space(t4) = offset_space(t4)*2 = mem_index(t3)*4
	add	t4, a0, t4			# mem_addr(t4) = &addr(a0) + offset_space(t4) 
	lw	t5, 0(t4)			# t5 = *mem_addr(t4)
	add	t6, t2, t2			# offset_space(t6) = st_index(t2)*2
	add	t6, t6, t6			# offset_space(t6) = offset_space(t6)*2 = st_index(t2)*4
	add	t6, sp ,t6			# st_addr(t6) = initial_addr(sp) + offset_space(t6) 
	sw	t5, 0(t6)			# *st_addr(t6) = t5
	addi	t2, t2, 1			# st_index + 1
	addi	t3, t3, 1			# mem_index + 1
	jal	x0, read2Stack
endRead:

	
	# do merge
	# view stack as two parts(left sorted & right sorted), compare each, then store to memory from little to big
	# prepare some variable
	# t2 : left part index  				=> 0
	# t3 : right part index					=> mid(a2) - start(a1) + 1 => t4 + 1
	# t4 : left terminate index(when t2 > t4)		=> mid(a2) - start(a1)
	# t5 : right terminate index(when t3 > t5)		=> end(a3) - start(a1)
	# t6 : write back index of old array(memory)		=> start(a1)
	sub	t4, a2, a1 			# left max(t4) = mid(a2) - start(a1)
	sub	t5, a3, a1			# right max(t5) = end(a3) - start(a1)
	addi	t2, x0, 0			# left index(t2) = start form 0
	addi	t3, t4, 1			# right index(t3) = start from mid(a2) - start(a1) + 1
	addi	t6, a1, 0			# t6 : write back index from start(a1)


	# if((left index <= left terminate) && (right index <= right terminate)), else go to label(endMergeLoop)
	# t0 : boolean <= (left terminate(t4) < left index(t2))
	# t1 : boolean <= (right terminate(t5) < right index(t3))
	# if t0 or t1 is true, it will be go to  go to label(endMergeLoop)
	# if neither => go on
	# t0 : calculate the left mem, then load value in it
	# t1 : calculate the right mem, then load value in it
	# compare which is smaller, then save the smaller in old array(memory)
mergeLoop:
	slt	t0, t4, t2				# t0 = (left terminate(t4) < left index(t2))
	slt	t1, t5, t3				# t1 = (right terminate(t5) < right index(t3))
	or	t0, t0, t1				# t0 = t0 || t1
	xori	t0, t0, 0x1				# t0 = ~t0
	beq	t0, x0, endMergeLoop			# if((t0||t1) != 0), else go to label(endMergeLoop)
	add	t0, t2, t2				# offset_space(t0) = left_index(t2)*2
	add	t0, t0, t0				# offset_space(t0) = offset_space(t0)*2 = left_index(t2)*4
	add	t0, sp, t0				# left_addr(t0) = initial_addr(sp) + offset_space(t0)
	lw	t0, 0(t0)				# left_value(t0) = *left_addr(t0)
	add	t1, t3, t3				# offset_space(t1) = right_index(t3)*2
	add	t1, t1, t1				# offset_space(t1) = offset_space(t1)*2 = right_index(t3)*4
	add	t1, sp, t1				# right_addr(t1) = initial_addr(sp) + offset_space(t1)
	lw	t1, 0(t1)				# right_value(t1) = *right_addr(t1)
	blt	t1, t0,  rightSmaller			# if(left_value(t0)<=right_value(t1)), else go to label(rightSmaller)
	add	t1, t6, t6				# offset_space(t1) = mem_index(t6)*2
	add	t1, t1, t1				# offset_space(t1) = offset_space(t1)*2 = mem_index(t6)*4
	add	t1, a0, t1				# mem_addr(t1) = &addr(a0) + offset_space(t1)
	sw	t0, 0(t1)				# *mem_addr(t1) = left_value(t0)
	addi	t6, t6, 1				# mem_index + 1
	addi	t2, t2, 1				# left_index + 1
	jal	x0, mergeLoop
rightSmaller:
	add	t0, t6, t6				# offset_space(t0) = mem_index(t6)*2
	add	t0, t0, t0				# offset_space(t0) = offset_space(t0)*2 = mem_index(t6)*4
	add	t0, a0, t0				# mem_addr(t0) = &addr(a0) + offset_space(t0)
	sw	t1, 0(t0)				# *mem_addr(t0) = right_value(t1)
	addi	t6, t6, 1				# mem_index + 1
	addi	t3, t3, 1				# right_index + 1
	jal	x0, mergeLoop
endMergeLoop:


	# if left part still have elements, store leave in to old array
	# if right part still have elements, store leave in to old array
	bge	t5, t3, rightLoop			# if(right index <= right terminate) => right part still have elements, go to label(rightLoop)
leftLoop:						# else go on label(leftLoop)
	add	t0, t2, t2				# offset_space(t0) = left_index(t2)*2
	add	t0, t0, t0				# offset_space(t0) = offset_space(t0)*2 = left_index(t2)*4
	add	t0, sp, t0				# left_addr(t0) = initial_addr(sp) + offset_space(t0)
	lw	t0, 0(t0)				# left_value(t0) = *left_addr(t0)
	add	t1, t6, t6				# offset_space(t1) = mem_index(t6)*2
	add	t1, t1, t1				# offset_space(t1) = offset_space(t1)*2 = mem_index(t6)*4
	add	t1, a0, t1				# mem_addr(t1) = &addr(a0) + offset_space(t1)
	sw	t0, 0(t1)				# *mem_addr(t1) = left_value(t0)
	addi	t6, t6, 1				# mem_index + 1
	addi	t2, t2, 1				# left_index + 1
	bge	t4, t2, leftLoop			# check left part whether still have elements
	jal	x0, endMerge
rightLoop:
	add	t1, t3, t3				# offset_space(t1) = right_index(t3)*2
	add	t1, t1, t1				# offset_space(t1) = offset_space(t1)*2 = right_index(t3)*4
	add	t1, sp, t1				# right_addr(t1) = initial_addr(sp) + offset_space(t1)
	lw	t1, 0(t1)				# right_value(t1) = *right_addr(t1)
	add	t0, t6, t6				# offset_space(t0) = mem_index(t6)*2
	add	t0, t0, t0				# offset_space(t0) = offset_space(t0)*2 = mem_index(t6)*4
	add	t0, a0, t0				# mem_addr(t0) = &addr(a0) + offset_space(t0)
	sw	t1, 0(t0)				# *mem_addr(t0) = right_value(t1)
	addi	t6, t6, 1				# mem_ndex + 1
	addi	t3, t3, 1				# right_index + 1
	bge	t5, t3, rightLoop			# check right part whether still have elements
	jal	x0, endMerge


	# revert stack space
	# t0 : total count of index 	=> end - start + 1
	# t1 : total space of memsize 	=> t0*4
endMerge:
	sub	t0, a3, a1			# total(t0) = end(a3) - start(a1)
	addi	t0, t0, 1			# total(t0) = end(a3) - start(a1) + 1
	add	t1, t0, t0			# space(t1) = total(t0)*2
	add	t1, t1, t1			# space(t1) = space(t1)*2 = total(t0)*4
	add	sp, sp, t1

	jalr	x0, ra, 0
###########################################




###########    printArray   ###############
printArray:
	addi	t0, a0, 0
	addi	t1, a1, 0
	addi	t2, x0, 0
printLoop:
	add	t3, t2, t2			# i+=i
	add	t3, t3, t3			# i+=i ==> i*4
	add	t3, t0, t3
	lw	a0, 0(t3)
	addi	a7, x0, 1
	ecall
	addi	t2, t2, 1
	bge	t2, t1, endPrint
	la	a0, comma
	addi	a7, x0, 4
	ecall
	jal	x0, printLoop
endPrint:
	la  	a0, newline
	addi	a7, x0, 4
	ecall
	jalr	x0, ra, 0
###########################################

mergeSort Function:
The mergeSort function is the recursive implementation of Merge Sort. It is responsible for continuously splitting the array into smaller subarrays until each subarray contains only one element. When start >= end, the recursion ends, as the subarray is already sorted. The midpoint (mid) is calculated using a simulated division operation. After splitting, the left and right subarrays are recursively sorted, and then the merge function is called to combine the two sorted subarrays.
merge Function
The merge function is responsible for merging two sorted subarrays into a single sorted array. It begins by copying the data within the start to end range into the stack for temporary storage. Then, it uses two pointers to traverse the left and right subarrays, comparing the current elements from each subarray. The smaller element is written back to the original array, and the corresponding pointer is moved forward. If one subarray is exhausted, the remaining elements of the other subarray are directly copied back into the original array. After merging, the allocated stack space is released.

Mergesort Bottom-up









































































































































































































































.data
size: 		.word   8
arr: 		.word   7, 4, 10, -1, 3, 2, -6, 9
str1: 		.string " Before Sort : "
str2: 		.string " After Sort  : "
newline: 	.string "\n"
comma:		.string " , "

.text
main:
	addi 	sp, sp, -4
	sw	ra, 0(sp)	
	
	# print str1
	la	a0, str1
	addi 	a7, x0, 4
	ecall

	# print current array
	la	a0, arr
	lw	a1, size
	jal	ra, printArray

	# do iterative MergeSort
	la	a0, arr
	lw	a1, size
	jal	ra, iterativeMergeSort

	# print str2
	la	a0, str2
	addi 	a7, x0, 4
	ecall

	# print sorted array
	la	a0, arr
	lw	a1, size
	jal	ra, printArray
	
	lw	ra, 0(sp)
	addi	sp, sp, 4
	addi    a7, x0, 10
        ecall
        jalr	x0, ra, 0

# Iterative MergeSort
# a0: array address
# a1: array size
iterativeMergeSort:
    addi    sp, sp, -24
    sw      ra, 0(sp)
    sw      s0, 4(sp)      # array address
    sw      s1, 8(sp)      # size
    sw      s2, 12(sp)     # current width
    sw      s3, 16(sp)     # left start
    sw      s4, 20(sp)     # iteration counter
    
    mv      s0, a0         # save array address
    mv      s1, a1         # save size
    
    # Start with merging subarrays of size 1
    addi    s2, x0, 1      # width starts from 1
    
width_loop:
    bge     s2, s1, end_sort   # if width >= size, we're done
    
    # For each width, merge subarrays
    addi    s3, x0, 0      # left = 0
    
left_loop:
    # Calculate right start and end
    add     t0, s3, s2     # mid = left + width
    add     t1, t0, s2     # right_end = mid + width
    
    # Ensure right_end doesn't exceed array size
    bge     t1, s1, adjust_end
    j       do_merge
    
adjust_end:
    mv      t1, s1         # right_end = size
    
do_merge:
    # If there are elements to merge
    bge     s3, t0, next_iteration
    
    # Prepare arguments for merge
    mv      a0, s0         # array address
    mv      a1, s3         # left start
    addi    a2, t0, -1     # mid - 1
    addi    a3, t1, -1     # right_end - 1
    jal     ra, merge
    
next_iteration:
    add     s3, s3, s2     # left += width
    add     s3, s3, s2     # left += width (total: left += 2*width)
    blt     s3, s1, left_loop
    
    # Double the width for next iteration
    add     s2, s2, s2     # width *= 2
    j       width_loop
    
end_sort:
    lw      ra, 0(sp)
    lw      s0, 4(sp)
    lw      s1, 8(sp)
    lw      s2, 12(sp)
    lw      s3, 16(sp)
    lw      s4, 20(sp)
    addi    sp, sp, 24
    jalr    x0, ra, 0

# Merge function (same as original)
merge:
    sub     t0, a3, a1
    addi    t0, t0, 1
    add     t1, t0, t0
    add     t1, t1, t1
    xori    t2, t1, 0xffffffff
    addi    t2, t2, 1
    add     sp, sp, t2

    addi    t3, a1, 0
    addi    t2, x0, 0
read2Stack:
    blt     a3, t3, endRead
    add     t4, t3, t3
    add     t4, t4, t4
    add     t4, a0, t4
    lw      t5, 0(t4)
    add     t6, t2, t2
    add     t6, t6, t6
    add     t6, sp, t6
    sw      t5, 0(t6)
    addi    t2, t2, 1
    addi    t3, t3, 1
    j       read2Stack
endRead:

    sub     t4, a2, a1
    sub     t5, a3, a1
    addi    t2, x0, 0
    addi    t3, t4, 1
    addi    t6, a1, 0

mergeLoop:
    slt     t0, t4, t2
    slt     t1, t5, t3
    or      t0, t0, t1
    xori    t0, t0, 0x1
    beq     t0, x0, endMergeLoop
    add     t0, t2, t2
    add     t0, t0, t0
    add     t0, sp, t0
    lw      t0, 0(t0)
    add     t1, t3, t3
    add     t1, t1, t1
    add     t1, sp, t1
    lw      t1, 0(t1)
    blt     t1, t0, rightSmaller
    add     t1, t6, t6
    add     t1, t1, t1
    add     t1, a0, t1
    sw      t0, 0(t1)
    addi    t6, t6, 1
    addi    t2, t2, 1
    j       mergeLoop
rightSmaller:
    add     t0, t6, t6
    add     t0, t0, t0
    add     t0, a0, t0
    sw      t1, 0(t0)
    addi    t6, t6, 1
    addi    t3, t3, 1
    j       mergeLoop
endMergeLoop:

    bge     t5, t3, rightLoop
leftLoop:
    add     t0, t2, t2
    add     t0, t0, t0
    add     t0, sp, t0
    lw      t0, 0(t0)
    add     t1, t6, t6
    add     t1, t1, t1
    add     t1, a0, t1
    sw      t0, 0(t1)
    addi    t6, t6, 1
    addi    t2, t2, 1
    bge     t4, t2, leftLoop
    j       endMerge
rightLoop:
    add     t1, t3, t3
    add     t1, t1, t1
    add     t1, sp, t1
    lw      t1, 0(t1)
    add     t0, t6, t6
    add     t0, t0, t0
    add     t0, a0, t0
    sw      t1, 0(t0)
    addi    t6, t6, 1
    addi    t3, t3, 1
    bge     t5, t3, rightLoop

endMerge:
    sub     t0, a3, a1
    addi    t0, t0, 1
    add     t1, t0, t0
    add     t1, t1, t1
    add     sp, sp, t1
    jalr    x0, ra, 0

# Print Array function (same as original)
printArray:
    addi    t0, a0, 0
    addi    t1, a1, 0
    addi    t2, x0, 0
printLoop:
    add     t3, t2, t2
    add     t3, t3, t3
    add     t3, t0, t3
    lw      a0, 0(t3)
    addi    a7, x0, 1
    ecall
    addi    t2, t2, 1
    bge     t2, t1, endPrint
    la      a0, comma
    addi    a7, x0, 4
    ecall
    j       printLoop
endPrint:
    la      a0, newline
    addi    a7, x0, 4
    ecall
    jalr    x0, ra, 0

iterativeMergeSort function:
The core logic of iterativeMergeSort lies in simulating the recursive divide-and-conquer process by using loops to gradually merge subarrays. Initially, the size of each subarray is set to 1, meaning it starts by merging single-element subarrays. As the iterations progress, the size of the subarrays doubles—for example, from size 1 to size 2, then to size 4, and so on—until the entire array is completely sorted. In each iteration, the program calculates the left and right boundaries of the current subarrays to be processed, ensuring the boundaries do not exceed the array's size. It then calls the merge function to combine the left and right subarrays into a single sorted subarray.
mergefunction:
The merge function is responsible for merging two sorted subarrays. It begins by copying the elements of the subarrays into the stack as a temporary storage area, and then uses the two-pointer method to traverse both subarrays simultaneously, writing the smaller element back into the original array one at a time. Once one of the subarrays is fully processed, the remaining elements from the other subarray are directly copied into the array. Finally, the temporary stack space is released to prevent memory wastage.

Performance Analysis

Execution Info

	Top-Down	Bottoom up
Cycles	2032	1616
Instrs. retired	1564	1280
CPI	1.3	1.26
IPC	0.77	0.792

Starting with Cycles, the Top-Down approach takes 2032 cycles to complete execution, whereas the Bottom-Up approach requires only 1616 cycles, indicating that the Bottom-Up method is faster overall. Looking at Instructions Retired, the Top-Down approach completes 1564 instructions, while the Bottom-Up approach only needs 1280 instructions to accomplish the same task. This shows that Bottom-Up is more efficient in design, reducing the number of operations needed for execution.
In terms of CPI (the average number of cycles required per instruction), the Top-Down approach has a CPI of 1.3, while the Bottom-Up approach achieves a lower CPI of 1.26. This means that the Bottom-Up approach requires fewer cycles per instruction on average, indicating slightly higher efficiency in the processor's pipeline. On the other hand, looking at IPC (the number of instructions executed per cycle), the Bottom-Up approach achieves 0.792, which is higher than the 0.77 achieved by the Top-Down approach. This demonstrates that the Bottom-Up method utilizes processor resources and instruction parallelism more effectively.

Cache configuration


Repl.policy	LRU
Wr. hit	Write-back
Wr. miss	Write allocate

Statistics

	Top-Down	Bottoom up
Hit rate	0.9627	0.9521
Hits	232	139
Misses	9	7

While the Top-Down method achieves a slightly higher hit rate of 96.27% compared to Bottom-Up's 95.21%, the Bottom-Up approach significantly reduces the total number of memory accesses, with only 146 accesses (139 hits and 7 misses) compared to Top-Down's 241 accesses (232 hits and 9 misses). This indicates that Bottom-Up requires far fewer memory operations during execution, primarily due to its more structured and continuous memory access pattern . As a result of the reduced memory access frequency and better handling of cache misses, the Bottom-Up approach achieves higher overall efficiency, which is reflected in its lower total cycle count.

DFS

Basic concept

Depth-First Search (DFS) is a classic algorithm for traversing a graph or tree. Its key characteristic is to explore as deeply as possible along one branch before backtracking to explore other unvisited branches. Recursive implementation is a common and natural way to implement DFS, as it leverages the function call stack to handle backtracking.

Disadvantage of using Recursive DFS

1. Stack Overflow

Recursive DFS relies on the function call stack to manage the traversal. For each recursive call, a new frame is pushed onto the stack.
If the graph is large or has deep paths (e.g., in a tree-like structure), the recursion depth can exceed the system's stack limit, leading to a stack overflow error.

2.Inefficient Memory Usage

Each recursive call stores information on the call stack, such as:Parameters passed to the function,Local variables,Return addresses.
As the recursion depth increases, the stack memory usage grows linearly with the depth.

DFS C code





























































































#include <stdio.h>
#include <stdlib.h>

// Node structure for adjacency list
struct Node {
    int dest;
    struct Node* next;
};

// Structure to represent an adjacency list
struct AdjList {
    struct Node* head;
};

// Function to create a new adjacency list node
struct Node* createNode(int dest) {
    struct Node* newNode =
      (struct Node*)malloc(sizeof(struct Node));
    newNode->dest = dest;
    newNode->next = NULL;
    return newNode;
}

// Recursive function for DFS traversal
void DFSRec(struct AdjList adj[], int visited[], int s) {
    // Mark the current vertex as visited
    visited[s] = 1;

    // Print the current vertex
    printf("%d ", s);

    // Traverse all adjacent vertices that are not visited yet
    struct Node* current = adj[s].head;
    while (current != NULL) {
        int dest = current->dest;
        if (!visited[dest]) {
            DFSRec(adj, visited, dest);
        }
        current = current->next;
    }
}

// Main DFS function that initializes the visited array
// and calls DFSRec
void DFS(struct AdjList adj[], int V, int s) {
    // Initialize visited array to false
    int visited[5] = {0}; 
    DFSRec(adj, visited, s);
}

// Function to add an edge to the adjacency list
void addEdge(struct AdjList adj[], int s, int t) {
    // Add edge from s to t
    struct Node* newNode = createNode(t);
    newNode->next = adj[s].head;
    adj[s].head = newNode;

    // Due to undirected Graph
    newNode = createNode(s);
    newNode->next = adj[t].head;
    adj[t].head = newNode;
}

int main() {
    int V = 5;

    // Create an array of adjacency lists
    struct AdjList adj[V];

    // Initialize each adjacency list as empty
    for (int i = 0; i < V; i++) {
        adj[i].head = NULL;
    }
    
    int E = 5;
    // Define the edges of the graph
    int edges[][2] = {{1, 2}, {1, 0}, {2, 0}, {2, 3}, {2, 4}};

    // Populate the adjacency list with edges
    for (int i = 0; i < E; i++) {
        addEdge(adj, edges[i][0], edges[i][1]);
    }

    int source = 1; 
    printf("DFS from source: %d\n", source);
    DFS(adj, V, source);

    return 0;
}

createNode function:
The createNode function is responsible for creating a new node. It takes the destination node number dest as a parameter, allocates memory for a new Node structure, sets dest as the value of the node, and initializes its next pointer to NULL. Finally, it returns the pointer to the newly created node.
DFSRec function:
The DFSRec function implements the recursive depth-first search. Starting from the given node s, it marks the node as visited and prints it. Then, it iterates through all the neighbors of node s. For each unvisited neighbor, it recursively calls DFSRec until all reachable nodes from s have been visited.
DFS function:
The DFS function is the entry point for the depth-first search. It initializes a visited array visited to track which nodes have been visited, and then calls the recursive DFSRec function, starting the traversal from the specified source node s.

DFS





































































































































.data
# 5x5 adjacency matrix for the graph
adj_matrix:
.word 0,1,1,0,0 # node 0's connections
.word 1,0,1,0,0 # node 1's connections
.word 1,1,0,1,1 # node 2's connections
.word 0,0,1,0,0 # node 3's connections
.word 0,0,1,0,0 # node 4's connections

visited: .word 0,0,0,0,0 # visited array
msg1: .string "DFS starting from node: "
arrow: .string " -> "
newline: .string "\n"

.text
.global main

main:
# Print start message
la a0, msg1
li a7, 4
ecall

# Print starting node (1)
li a0, 1
li a7, 1
ecall

la a0, newline
li a7, 4
ecall

# Initialize visited array
la t0, visited
sw zero, 0(t0)
sw zero, 4(t0)
sw zero, 8(t0)
sw zero, 12(t0)
sw zero, 16(t0)

# Start DFS from node 1
li a0, 1
jal dfs

# Print final newline
la a0, newline
li a7, 4
ecall

# Exit program
li a7, 10
ecall

# DFS function
dfs:
# Save return address and s0-s2
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
sw s1, 4(sp)
sw s2, 0(sp)

# Save current node in s0
mv s0, a0

# Check if already visited
la t0, visited
slli t1, s0, 2
add t1, t0, t1
lw t2, 0(t1)
bnez t2, dfs_return # if already visited, return

# Mark as visited
li t2, 1
sw t2, 0(t1)

# Print current node
mv a0, s0
li a7, 1
ecall

# Set up for neighbor checking
mv s1, zero # Initialize neighbor counter
li s2, 0    # Flag to track if we found any unvisited neighbors

check_neighbors:
li t0, 5
beq s1, t0, after_neighbors # If checked all neighbors, go to after_neighbors

# Calculate address of current edge in adjacency matrix
la t0, adj_matrix
li t1, 20 # 5 nodes * 4 bytes per row
mul t1, s0, t1 # Get row offset
add t0, t0, t1 # Get row address
slli t1, s1, 2 # Get column offset
add t0, t0, t1 # Get edge address
lw t1, 0(t0) # Load edge value

# If no edge to this neighbor, skip
beqz t1, next_neighbor

# Check if neighbor is unvisited
la t0, visited
slli t1, s1, 2
add t0, t0, t1
lw t1, 0(t0)
bnez t1, next_neighbor # If visited, skip

# Print arrow before visiting new node
la a0, arrow
li a7, 4
ecall

# Visit neighbor through recursive call
mv a0, s1
li s2, 1    # Set flag indicating we found an unvisited neighbor
jal dfs

next_neighbor:
addi s1, s1, 1 # Increment neighbor counter
j check_neighbors

after_neighbors:
beqz s2, dfs_return  # If no unvisited neighbors were found, return without printing arrow

dfs_return:
# Restore saved registers
lw s2, 0(sp)
lw s1, 4(sp)
lw s0, 8(sp)
lw ra, 12(sp)
addi sp, sp, 16
ret

main function:
The main function serves as the program's entry point and is responsible for overall control flow. It starts by printing an initial message and initializing the visited array to mark all nodes as unvisited. Then, it begins the depth-first search (DFS) traversal by calling the dfs function starting from node 1. After the traversal is complete, the main function prints a newline and terminates the program using a system call.
dfs function:
The dfs function implements the recursive depth-first search. Upon entering the function, it saves the current node and register states onto the stack to preserve the context during recursive calls. It then checks if the current node has already been visited. If not, it marks the node as visited and prints its value. The function proceeds to iterate over the neighbors of the current node. For any unvisited neighbor, it recursively calls dfs to continue the traversal. Once all neighbors of the current node are processed, the function restores the saved register states and returns to the previous level.
Neighbor Checking Logic:
The neighbor-checking logic, embedded within the dfs function, is responsible for iterating over all neighbors of the current node. Using the adjacency matrix, the function calculates each neighbor's position and checks if there is an edge connection (matrix value of 1) and whether the neighbor is unvisited. If both conditions are met, the program outputs an arrow symbol to indicate a connection and recursively calls dfs to traverse that neighbor. Once all neighbors are processed, the checking logic ends, and dfs starts to backtrack.

Optimized DFS

























































































































.data
# 5x5 adjacency matrix for the graph
adj_matrix:
.word 0,1,1,0,0 # node 0's connections
.word 1,0,1,0,0 # node 1's connections
.word 1,1,0,1,1 # node 2's connections
.word 0,0,1,0,0 # node 3's connections
.word 0,0,1,0,0 # node 4's connections
visited: .word 0,0,0,0,0 # visited array
msg1: .string "DFS starting from node: "
arrow: .string " -> "
newline: .string "\n"

.text
.global main
main:
    # Print start message
    la a0, msg1
    li a7, 4
    ecall
    
    # Print starting node (1)
    li a0, 1
    li a7, 1
    ecall
    la a0, newline
    li a7, 4
    ecall
    
    # Initialize visited array
    la t0, visited
    sw zero, 0(t0)
    sw zero, 4(t0)
    sw zero, 8(t0)
    sw zero, 12(t0)
    sw zero, 16(t0)
    
    # Start optimized DFS from node 1
    li a0, 1
    jal optimized_dfs
    
    # Print final newline
    la a0, newline
    li a7, 4
    ecall
    
    # Exit program
    li a7, 10
    ecall

# Optimized DFS implementation
# a0: current node
optimized_dfs:
    # Save return address and s registers
    addi sp, sp, -16
    sw ra, 12(sp)
    sw s0, 8(sp)
    sw s1, 4(sp)
    sw s2, 0(sp)
    
    # Save current node in s0
    mv s0, a0
    
    # Mark as visited and print current node
    la t0, visited
    slli t1, s0, 2
    add t1, t0, t1
    li t2, 1
    sw t2, 0(t1)        # Mark visited
    
    mv a0, s0           # Print node
    li a7, 1
    ecall
    
    # Initialize neighbor counter
    li s1, 0            # Current neighbor
    
check_neighbors:
    li t0, 5
    beq s1, t0, dfs_return  # If checked all neighbors, return
    
    # Load adjacency info (cached in register)
    la t0, adj_matrix
    li t1, 20           # 5 nodes * 4 bytes
    mul t1, s0, t1      # Row offset
    add t0, t0, t1
    slli t1, s1, 2      # Column offset
    add t0, t0, t1
    lw t2, 0(t0)        # Load edge value
    
    # If no edge or already visited, skip
    beqz t2, next_neighbor
    
    # Check if neighbor is visited (reuse previously loaded visited array address)
    la t0, visited
    slli t1, s1, 2
    add t1, t0, t1
    lw t2, 0(t1)
    bnez t2, next_neighbor
    
    # Found unvisited neighbor - print arrow and visit
    la a0, arrow
    li a7, 4
    ecall
    
    # Recursively visit neighbor
    mv a0, s1
    jal optimized_dfs
    
next_neighbor:
    addi s1, s1, 1
    j check_neighbors
    
dfs_return:
    # Restore registers and return
    lw s2, 0(sp)
    lw s1, 4(sp)
    lw s0, 8(sp)
    lw ra, 12(sp)
    addi sp, sp, 16
    ret

main function:
The main function serves as the entry point of the program and coordinates the execution flow. It starts by printing a message indicating the start of DFS traversal and initializes the visited array, marking all nodes as unvisited. It then begins the traversal from node 1 by invoking the optimized_dfs function. Once the traversal is complete, the function prints a newline and exits the program using a system call.
optimized_dfs function
The optimized_dfs function is the recursive implementation of depth-first search. It begins by saving the return address and relevant registers onto the stack to preserve the context during recursion. The current node is then marked as visited in the visited array, and its value is printed. The function iterates through all potential neighbors of the current node by consulting the adjacency matrix. If a neighbor is connected (edge exists) and unvisited, the function prints an arrow symbol to indicate traversal and recursively calls itself to explore the neighbor. Once all neighbors are processed, the function restores the saved register values from the stack and returns to the previous level of recursion.
Neighbor Checking Logic:
The neighbor-checking logic is embedded within the optimized_dfs function. It calculates the position of each neighbor in the adjacency matrix by determining the row and column offsets for the current node. For each neighbor, it checks if there is a connecting edge and whether the neighbor has already been visited. If both conditions are satisfied, the neighbor is identified as an unvisited connection, and the traversal proceeds to recursively explore it. This logic ensures that all reachable nodes are explored while avoiding revisits to already visited nodes.

Comparison of optimized DFS and DFS

1. Efficiency of Memory Address Calculation

Opt DFS:






# Calculate the base address of the visited array and store it in t0
la t0, visited
slli t1, s0, 2         # Calculate the offset for the current node in visited
add t1, t0, t1         # Compute the access address
li t2, 1
sw t2, 0(t1)           # Mark the current node as visited

The base address of visited (la t0, visited) is calculated only once and stored in a register. Subsequent operations reuse this base address, reducing redundant calculations.
The offset for the current node (s0) is calculated once and used directly in subsequent memory access.

DFS:





# Recalculate the base address of the visited array for every neighbor
la t0, visited
slli t1, s1, 2         # Calculate the offset for the neighbor node in visited
add t0, t0, t1         # Compute the access address
lw t1, 0(t0)           # Check if it is already visited

The base address of visited is recalculated for every neighbor, leading to unnecessary computations and additional memory access.

2. Efficiency of Adjacency Matrix Address Calculation

Opt DFS:





# Calculate the base address of the adjacency matrix row
la t0, adj_matrix
li t1, 20             # Each row has 5 nodes, each node takes 4 bytes
mul t1, s0, t1        # Compute the row offset for the current node
add t0, t0, t1        # Compute the row base address (only calculated once)

The base address of the adjacency matrix row for the current node is calculated only once. The result is stored in a register and reused for each neighbor, avoiding repeated calculations.
For each neighbor check, only the column offset needs to be added, making it efficient.

DFS:









# Recalculate the row base address for every neighbor
la t0, adj_matrix
li t1, 20             # Each row has 5 nodes, each node takes 4 bytes
mul t1, s0, t1        # Compute the row offset for the current node
add t0, t0, t1        # Compute the row base address
slli t1, s1, 2        # Compute the column offset
add t0, t0, t1        # Compute the final address
lw t2, 0(t0)          # Read the adjacency value

The row base address and column offset are recalculated for every neighbor check, introducing redundant operations and memory accesses.

3. Cache Hit Rate and Access Patterns

Opt DFS:

The base addresses of the adjacency matrix and visited array are reused multiple times, exhibiting temporal locality in cache usage.
Accesses to the adjacency matrix and visited array are predictable and concentrated, resulting in a high cache hit rate.

DFS:

Recalculating the base addresses and offsets for every neighbor leads to more scattered memory access patterns.
These less predictable memory access patterns result in lower cache hit rates and increased main memory access.

4. Simplified Conditional Logic

Opt DFS






# Efficiently skip already visited neighbors
la t0, visited
slli t1, s1, 2
add t1, t0, t1
lw t2, 0(t1)          # Check if the neighbor is already visited
bnez t2, next_neighbor

The conditional check for whether a neighbor is visited is separated from address calculations. It operates directly on already-computed values in registers, reducing complexity.

DFS:








# Mixed conditional checks and redundant calculations
la t0, adj_matrix
mul t1, s0, 20        # Calculate row offset
add t0, t0, t1
slli t1, s1, 2        # Calculate column offset
add t0, t0, t1
lw t1, 0(t0)          # Check adjacency value
beqz t1, next_neighbor

Conditional checks are intertwined with redundant address calculations, leading to additional overhead during each neighbor check.

Performance Analysis

Execution Info

	DFS	Opt DFS
Cycles	777	744
Instrs. retired	579	555
CPI	1.34	1.34
IPC	0.745	0.746

Opt DFS consumes fewer CPU cycles because it eliminates redundant calculations and minimizes memory accesses by efficiently reusing values stored in registers. For example, the base address of the adjacency matrix for a node is computed only once and stored in a register, which is then reused for all neighbor checks. This avoids recalculating the same base address multiple times, as seen in DFS. Similarly, the base address of the visited array is calculated once and reused, reducing the overhead associated with repeated address computations.

In DFS, every neighbor check recalculates the row base address and column offset for the adjacency matrix and re-accesses the visited array's base address. These repeated operations increase the number of executed instructions, leading to more cycles. Additionally, Opt DFS's streamlined structure allows for better instruction-level parallelism, as the CPU can execute fewer dependent instructions in sequence.

By reducing redundant operations and leveraging efficient memory and register usage, Opt DFS lowers the total instruction count and improves execution efficiency, directly reducing the number of CPU cycles required.

Cache configuration


Repl.policy	LRU
Wr. hit	Write-back
Wr. miss	Write allocate

Statistics

	DFS	Opt DFS
Hit rate	0.8667	0.8588
Hits	78	73
Misses	12	12

It’s evident that while Opt DFS is optimized, its cache hit rate is slightly lower than DFS. This is because Opt DFS's optimization focuses on significantly reducing the total number of memory accesses, whereas DFS performs far more frequent memory accesses due to redundant calculations and repeated base address recalculations. These additional memory accesses in DFS increase its cache hit count and hit rate, but they also introduce unnecessary overhead.

On the other hand, both implementations have the same number of cache misses, which suggests that they both encounter memory blocks that cannot fully benefit from the cache. This is likely due to the adjacency matrix and visited array involving memory blocks that lack spatial locality, leading to some unavoidable cache misses.

The critical difference lies in the total number of memory accesses. Opt DFS reduces the number of accesses by reusing computed addresses stored in registers, while DFS repeatedly calculates and accesses memory for each neighbor. As a result, Opt DFS achieves fewer overall memory accesses and therefore consumes fewer CPU cycles, making it much more efficient than DFS.