陳奕嘉
GitHub
Merge Sort
Merge sort is a basic sorting algorithm. It is a sorting method that belongs to the Devide and Conquer method. Its characteristics are stable sorting algorithm and the time complexity is O(nlogn).
Basic Concept
Merge Sort works by dividing the sequence to be sorted into smaller subsequences, sorting each subsequence, and finally merging them to generate a complete sorting result.
Top-Down and Bottom-up method
-
Top-Down : Use recursion to continuously divide the array until each subsequence has only one element, and then merge these subsequences from small to large to gradually build the sorted array.
- Disadvantages: Will consume the stacking space of the topology
-
Bottom-up : Without recursion, start sorting from the smallest subsequence and gradually build larger sorted subsequences until the entire array is sorted.
- Advantages : Avoids recursion, is suitable for non-recursive environments, and has low stack space requirements.
- Disadvantage :The program structure is slightly complex and the merging strategy requires additional attention.
Compare the impact of Top-Down and Bottom-Up methods on Cache usage
Top-Down Approach
1.Memory Access Pattern:
- The Top-Down approach uses recursion to divide the array into smaller subarrays, working from the middle of the array toward smaller pieces.
- Each recursive call operates on non-contiguous subarrays, resulting in non-linear memory access patterns that can lead to higher cache misses.
2.Memory Overhead from Recursion:
- Each recursive call allocates additional stack memory, which increases memory usage and can interfere with efficient cache utilization.
- Deep recursion may cause data to be repeatedly loaded into and evicted from the cache, reducing its efficiency.
3.Merge Process:
- The merging process involves copying subarrays into temporary arrays and operating across different memory regions, increasing the likelihood of cache misses.
Bottom-Up Approach
1.Memory Access Pattern:
-
The Bottom-Up approach starts by merging subarrays of size 1 and gradually increases the size (e.g., 2, 4, 8, …).
-
Memory access patterns are typically sequential, making it more cache-efficient and easier to leverage prefetching mechanisms.
- Without any Recursion Overhead:
- Bottom-Up uses an iterative structure, eliminating the need for recursion and its associated stack overhead.
- This reduces the memory pressure on the cache and improves overall cache utilization.
3.Merge Process:
- During each iteration, the subarrays to be merged are generally adjacent in memory.
- This contiguous access pattern significantly reduces cache misses compared to the Top-Down approach.
Why Bottom-Up is More Cache-Efficient
1.Sequential Memory Access:
- Top-Down: Operates on non-contiguous memory regions, resulting in "jumpy" access patterns that lead to higher cache misses.
- Bottom-Up: Operates on contiguous memory regions, resulting in "sequential" access patterns that are cache-friendly.
2.Cache Prefetching:
- Modern processors utilize cache prefetching to optimize sequential memory access.
- Top-Down's irregular access patterns hinder prefetching, whereas Bottom-Up benefits from sequential access, triggering prefetching mechanisms effectively.
3.Recursion Overhead:
- Top-Down involves recursive stack allocation, which adds to memory and cache pressure.
- Bottom-Up eliminates recursion, allowing the cache to focus solely on data operations.
4.Data Reuse:
- Bottom-Up: Operates on smaller, adjacent subarrays, increasing data reuse and reducing cache misses.
- Top-Down: Subarray operations are spread across the memory, reducing reuse opportunities.
Comparison Table
Feature |
Top-Down Approach |
Bottom-Up Approach |
Memory Access Pattern |
Irregular, scattered, higher cache miss rate |
Sequential, contiguous, lower cache miss rate |
Cache Prefetching |
Inefficient due to unpredictable patterns |
Efficient due to sequential access |
Recursion Overhead |
Requires stack memory, increases cache load |
No recursion, reduces cache pressure |
Data Reuse |
Low, due to non-contiguous subarray access |
High, due to contiguous subarray access |
Best Use Cases |
Small arrays or scattered data |
Large arrays or high-performance requirements |
C Code: Top-Down
-
merge
function:
1.Create a temporary array array
to store elements from start
to end
(to avoid directly modifying the original array, making merging easier).
2.left_c
and right_c
to track the current positions in the left and right subarrays, respectively.l_max
and r_max
to indicate the maximum indexes of the left and right subarrays.
3.If the element in the left subarray is smaller or equal to the element in the right subarray, copy it back to the original array and move the left_c
pointer.Otherwise, copy the element from the right subarray and move the right_c pointer.
4.After finishing comparisons, copy any remaining elements from the left or right subarray back to the original array.
-
mergesort
function:
1. If start < end
- Calculate the middle index
mid = (start + end) / 2
.
- Recursively call
mergesort(arr, start, mid)
to sort the left subarray.
- Recursively call
mergesort(arr, mid + 1, end)
to sort the right subarray.
- Merge the sorted left and right subarrays using the
merge
function.
- If
start >= end
, the subarray contains only one element (which is inherently sorted), so the recursion stops.
Mergesort Top-down
.data
size: .word 8
arr: .word 7, 4, 10, -1, 3, 2, -6, 9
str1: .string " Before Sort : "
str2: .string " After Sort : "
newline: .string "\n"
comma: .string " , "
.text
main:
addi sp, sp, -4
sw ra, 0(sp)
la a0, str1
addi a7, x0, 4
ecall
la a0, arr
lw a1, size
jal ra, printArray
la a0, arr
addi a1, x0, 0
lw a2, size
addi a2, a2, -1
jal ra, mergeSort
la a0, str2
addi a7, x0, 4
ecall
la a0, arr
lw a1, size
jal ra, printArray
lw ra, 0(sp)
addi sp, sp, 4
addi a7, x0, 10
ecall
jalr x0, ra, 0
mergeSort
mergeSort:
addi sp, sp, -20
sw ra, 0(sp)
addi s0, a0, 0
addi s1, a1, 0
addi s2, a2, 0
bge s1, s2, endMergeSort
add s3, s1, s2
addi s4, x0, 0
div:
blt s3, x0, endDiv
addi s3, s3, -2
addi s4, s4, 1
jal x0, div
endDiv:
addi s3, s4, -1
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
addi a0, s0, 0
addi a1, s1, 0
addi a2, s3, 0
jal ra, mergeSort
lw a0, 4(sp)
lw a1, 16(sp)
addi a1, a1, 1
lw a2, 12(sp)
jal ra, mergeSort
lw a0, 4(sp)
lw a1, 8(sp)
lw a2, 16(sp)
lw a3, 12(sp)
jal ra, merge
endMergeSort:
lw ra, 0(sp)
addi sp, sp, 20
jalr x0, ra, 0
merge:
sub t0, a3, a1
addi t0, t0, 1
add t1, t0, t0
add t1, t1, t1
xori t2, t1, 0xffffffff
addi t2, t2, 1
add sp, sp, t2
addi t3, a1, 0
addi t2, x0, 0
read2Stack:
blt a3, t3, endRead
add t4, t3, t3
add t4, t4, t4
add t4, a0, t4
lw t5, 0(t4)
add t6, t2, t2
add t6, t6, t6
add t6, sp ,t6
sw t5, 0(t6)
addi t2, t2, 1
addi t3, t3, 1
jal x0, read2Stack
endRead:
sub t4, a2, a1
sub t5, a3, a1
addi t2, x0, 0
addi t3, t4, 1
addi t6, a1, 0
mergeLoop:
slt t0, t4, t2
slt t1, t5, t3
or t0, t0, t1
xori t0, t0, 0x1
beq t0, x0, endMergeLoop
add t0, t2, t2
add t0, t0, t0
add t0, sp, t0
lw t0, 0(t0)
add t1, t3, t3
add t1, t1, t1
add t1, sp, t1
lw t1, 0(t1)
blt t1, t0, rightSmaller
add t1, t6, t6
add t1, t1, t1
add t1, a0, t1
sw t0, 0(t1)
addi t6, t6, 1
addi t2, t2, 1
jal x0, mergeLoop
rightSmaller:
add t0, t6, t6
add t0, t0, t0
add t0, a0, t0
sw t1, 0(t0)
addi t6, t6, 1
addi t3, t3, 1
jal x0, mergeLoop
endMergeLoop:
bge t5, t3, rightLoop
leftLoop:
add t0, t2, t2
add t0, t0, t0
add t0, sp, t0
lw t0, 0(t0)
add t1, t6, t6
add t1, t1, t1
add t1, a0, t1
sw t0, 0(t1)
addi t6, t6, 1
addi t2, t2, 1
bge t4, t2, leftLoop
jal x0, endMerge
rightLoop:
add t1, t3, t3
add t1, t1, t1
add t1, sp, t1
lw t1, 0(t1)
add t0, t6, t6
add t0, t0, t0
add t0, a0, t0
sw t1, 0(t0)
addi t6, t6, 1
addi t3, t3, 1
bge t5, t3, rightLoop
jal x0, endMerge
endMerge:
sub t0, a3, a1
addi t0, t0, 1
add t1, t0, t0
add t1, t1, t1
add sp, sp, t1
jalr x0, ra, 0
printArray:
addi t0, a0, 0
addi t1, a1, 0
addi t2, x0, 0
printLoop:
add t3, t2, t2
add t3, t3, t3
add t3, t0, t3
lw a0, 0(t3)
addi a7, x0, 1
ecall
addi t2, t2, 1
bge t2, t1, endPrint
la a0, comma
addi a7, x0, 4
ecall
jal x0, printLoop
endPrint:
la a0, newline
addi a7, x0, 4
ecall
jalr x0, ra, 0
mergeSort
Function:
The mergeSort function is the recursive implementation of Merge Sort. It is responsible for continuously splitting the array into smaller subarrays until each subarray contains only one element. When start >= end
, the recursion ends, as the subarray is already sorted. The midpoint (mid
) is calculated using a simulated division operation. After splitting, the left and right subarrays are recursively sorted, and then the merge
function is called to combine the two sorted subarrays.
merge
Function
The merge
function is responsible for merging two sorted subarrays into a single sorted array. It begins by copying the data within the start
to end
range into the stack for temporary storage. Then, it uses two pointers to traverse the left and right subarrays, comparing the current elements from each subarray. The smaller element is written back to the original array, and the corresponding pointer is moved forward. If one subarray is exhausted, the remaining elements of the other subarray are directly copied back into the original array. After merging, the allocated stack space is released.
Mergesort Bottom-up
.data
size: .word 8
arr: .word 7, 4, 10, -1, 3, 2, -6, 9
str1: .string " Before Sort : "
str2: .string " After Sort : "
newline: .string "\n"
comma: .string " , "
.text
main:
addi sp, sp, -4
sw ra, 0(sp)
la a0, str1
addi a7, x0, 4
ecall
la a0, arr
lw a1, size
jal ra, printArray
la a0, arr
lw a1, size
jal ra, iterativeMergeSort
la a0, str2
addi a7, x0, 4
ecall
la a0, arr
lw a1, size
jal ra, printArray
lw ra, 0(sp)
addi sp, sp, 4
addi a7, x0, 10
ecall
jalr x0, ra, 0
iterativeMergeSort:
addi sp, sp, -24
sw ra, 0(sp)
sw s0, 4(sp)
sw s1, 8(sp)
sw s2, 12(sp)
sw s3, 16(sp)
sw s4, 20(sp)
mv s0, a0
mv s1, a1
addi s2, x0, 1
width_loop:
bge s2, s1, end_sort
addi s3, x0, 0
left_loop:
add t0, s3, s2
add t1, t0, s2
bge t1, s1, adjust_end
j do_merge
adjust_end:
mv t1, s1
do_merge:
bge s3, t0, next_iteration
mv a0, s0
mv a1, s3
addi a2, t0, -1
addi a3, t1, -1
jal ra, merge
next_iteration:
add s3, s3, s2
add s3, s3, s2
blt s3, s1, left_loop
add s2, s2, s2
j width_loop
end_sort:
lw ra, 0(sp)
lw s0, 4(sp)
lw s1, 8(sp)
lw s2, 12(sp)
lw s3, 16(sp)
lw s4, 20(sp)
addi sp, sp, 24
jalr x0, ra, 0
merge:
sub t0, a3, a1
addi t0, t0, 1
add t1, t0, t0
add t1, t1, t1
xori t2, t1, 0xffffffff
addi t2, t2, 1
add sp, sp, t2
addi t3, a1, 0
addi t2, x0, 0
read2Stack:
blt a3, t3, endRead
add t4, t3, t3
add t4, t4, t4
add t4, a0, t4
lw t5, 0(t4)
add t6, t2, t2
add t6, t6, t6
add t6, sp, t6
sw t5, 0(t6)
addi t2, t2, 1
addi t3, t3, 1
j read2Stack
endRead:
sub t4, a2, a1
sub t5, a3, a1
addi t2, x0, 0
addi t3, t4, 1
addi t6, a1, 0
mergeLoop:
slt t0, t4, t2
slt t1, t5, t3
or t0, t0, t1
xori t0, t0, 0x1
beq t0, x0, endMergeLoop
add t0, t2, t2
add t0, t0, t0
add t0, sp, t0
lw t0, 0(t0)
add t1, t3, t3
add t1, t1, t1
add t1, sp, t1
lw t1, 0(t1)
blt t1, t0, rightSmaller
add t1, t6, t6
add t1, t1, t1
add t1, a0, t1
sw t0, 0(t1)
addi t6, t6, 1
addi t2, t2, 1
j mergeLoop
rightSmaller:
add t0, t6, t6
add t0, t0, t0
add t0, a0, t0
sw t1, 0(t0)
addi t6, t6, 1
addi t3, t3, 1
j mergeLoop
endMergeLoop:
bge t5, t3, rightLoop
leftLoop:
add t0, t2, t2
add t0, t0, t0
add t0, sp, t0
lw t0, 0(t0)
add t1, t6, t6
add t1, t1, t1
add t1, a0, t1
sw t0, 0(t1)
addi t6, t6, 1
addi t2, t2, 1
bge t4, t2, leftLoop
j endMerge
rightLoop:
add t1, t3, t3
add t1, t1, t1
add t1, sp, t1
lw t1, 0(t1)
add t0, t6, t6
add t0, t0, t0
add t0, a0, t0
sw t1, 0(t0)
addi t6, t6, 1
addi t3, t3, 1
bge t5, t3, rightLoop
endMerge:
sub t0, a3, a1
addi t0, t0, 1
add t1, t0, t0
add t1, t1, t1
add sp, sp, t1
jalr x0, ra, 0
printArray:
addi t0, a0, 0
addi t1, a1, 0
addi t2, x0, 0
printLoop:
add t3, t2, t2
add t3, t3, t3
add t3, t0, t3
lw a0, 0(t3)
addi a7, x0, 1
ecall
addi t2, t2, 1
bge t2, t1, endPrint
la a0, comma
addi a7, x0, 4
ecall
j printLoop
endPrint:
la a0, newline
addi a7, x0, 4
ecall
jalr x0, ra, 0
iterativeMergeSort
function:
The core logic of iterativeMergeSort
lies in simulating the recursive divide-and-conquer process by using loops to gradually merge subarrays. Initially, the size of each subarray is set to 1, meaning it starts by merging single-element subarrays. As the iterations progress, the size of the subarrays doubles—for example, from size 1 to size 2, then to size 4, and so on—until the entire array is completely sorted. In each iteration, the program calculates the left and right boundaries of the current subarrays to be processed, ensuring the boundaries do not exceed the array's size. It then calls the merge
function to combine the left and right subarrays into a single sorted subarray.
merge
function:
The merge function is responsible for merging two sorted subarrays. It begins by copying the elements of the subarrays into the stack as a temporary storage area, and then uses the two-pointer method to traverse both subarrays simultaneously, writing the smaller element back into the original array one at a time. Once one of the subarrays is fully processed, the remaining elements from the other subarray are directly copied into the array. Finally, the temporary stack space is released to prevent memory wastage.
Execution Info
|
Top-Down |
Bottoom up |
Cycles |
2032 |
1616 |
Instrs. retired |
1564 |
1280 |
CPI |
1.3 |
1.26 |
IPC |
0.77 |
0.792 |
- Starting with Cycles, the Top-Down approach takes 2032 cycles to complete execution, whereas the Bottom-Up approach requires only 1616 cycles, indicating that the Bottom-Up method is faster overall. Looking at Instructions Retired, the Top-Down approach completes 1564 instructions, while the Bottom-Up approach only needs 1280 instructions to accomplish the same task. This shows that Bottom-Up is more efficient in design, reducing the number of operations needed for execution.
- In terms of CPI (the average number of cycles required per instruction), the Top-Down approach has a CPI of 1.3, while the Bottom-Up approach achieves a lower CPI of 1.26. This means that the Bottom-Up approach requires fewer cycles per instruction on average, indicating slightly higher efficiency in the processor's pipeline. On the other hand, looking at IPC (the number of instructions executed per cycle), the Bottom-Up approach achieves 0.792, which is higher than the 0.77 achieved by the Top-Down approach. This demonstrates that the Bottom-Up method utilizes processor resources and instruction parallelism more effectively.
Cache configuration
|
|
Repl.policy |
LRU |
Wr. hit |
Write-back |
Wr. miss |
Write allocate |
Statistics
|
Top-Down |
Bottoom up |
Hit rate |
0.9627 |
0.9521 |
Hits |
232 |
139 |
Misses |
9 |
7 |
- While the Top-Down method achieves a slightly higher hit rate of 96.27% compared to Bottom-Up's 95.21%, the Bottom-Up approach significantly reduces the total number of memory accesses, with only 146 accesses (139 hits and 7 misses) compared to Top-Down's 241 accesses (232 hits and 9 misses). This indicates that Bottom-Up requires far fewer memory operations during execution, primarily due to its more structured and continuous memory access pattern . As a result of the reduced memory access frequency and better handling of cache misses, the Bottom-Up approach achieves higher overall efficiency, which is reflected in its lower total cycle count.
DFS
Basic concept
Depth-First Search (DFS) is a classic algorithm for traversing a graph or tree. Its key characteristic is to explore as deeply as possible along one branch before backtracking to explore other unvisited branches. Recursive implementation is a common and natural way to implement DFS, as it leverages the function call stack to handle backtracking.
Disadvantage of using Recursive DFS
1. Stack Overflow
- Recursive DFS relies on the function call stack to manage the traversal. For each recursive call, a new frame is pushed onto the stack.
- If the graph is large or has deep paths (e.g., in a tree-like structure), the recursion depth can exceed the system's stack limit, leading to a stack overflow error.
2.Inefficient Memory Usage
- Each recursive call stores information on the call stack, such as:Parameters passed to the function,Local variables,Return addresses.
- As the recursion depth increases, the stack memory usage grows linearly with the depth.
DFS C code
-
createNode
function:
The createNode
function is responsible for creating a new node. It takes the destination node number dest
as a parameter, allocates memory for a new Node
structure, sets dest
as the value of the node, and initializes its next
pointer to NULL
. Finally, it returns the pointer to the newly created node.
-
DFSRec
function:
The DFSRec
function implements the recursive depth-first search. Starting from the given node s
, it marks the node as visited and prints it. Then, it iterates through all the neighbors of node s
. For each unvisited neighbor, it recursively calls DFSRec
until all reachable nodes from s
have been visited.
-
DFS
function:
The DFS
function is the entry point for the depth-first search. It initializes a visited array visited
to track which nodes have been visited, and then calls the recursive DFSRec
function, starting the traversal from the specified source node s
.
DFS
.data
adj_matrix:
.word 0,1,1,0,0 # node 0's connections
.word 1,0,1,0,0 # node 1's connections
.word 1,1,0,1,1 # node 2's connections
.word 0,0,1,0,0 # node 3's connections
.word 0,0,1,0,0 # node 4's connections
visited: .word 0,0,0,0,0 # visited array
msg1: .string "DFS starting from node: "
arrow: .string " -> "
newline: .string "\n"
.text
.global main
main:
la a0, msg1
li a7, 4
ecall
li a0, 1
li a7, 1
ecall
la a0, newline
li a7, 4
ecall
la t0, visited
sw zero, 0(t0)
sw zero, 4(t0)
sw zero, 8(t0)
sw zero, 12(t0)
sw zero, 16(t0)
li a0, 1
jal dfs
la a0, newline
li a7, 4
ecall
li a7, 10
ecall
dfs:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
sw s1, 4(sp)
sw s2, 0(sp)
mv s0, a0
la t0, visited
slli t1, s0, 2
add t1, t0, t1
lw t2, 0(t1)
bnez t2, dfs_return # if already visited, return
li t2, 1
sw t2, 0(t1)
mv a0, s0
li a7, 1
ecall
mv s1, zero # Initialize neighbor counter
li s2, 0 # Flag to track if we found any unvisited neighbors
check_neighbors:
li t0, 5
beq s1, t0, after_neighbors # If checked all neighbors, go to after_neighbors
la t0, adj_matrix
li t1, 20 # 5 nodes * 4 bytes per row
mul t1, s0, t1 # Get row offset
add t0, t0, t1 # Get row address
slli t1, s1, 2 # Get column offset
add t0, t0, t1 # Get edge address
lw t1, 0(t0) # Load edge value
beqz t1, next_neighbor
la t0, visited
slli t1, s1, 2
add t0, t0, t1
lw t1, 0(t0)
bnez t1, next_neighbor # If visited, skip
la a0, arrow
li a7, 4
ecall
mv a0, s1
li s2, 1 # Set flag indicating we found an unvisited neighbor
jal dfs
next_neighbor:
addi s1, s1, 1 # Increment neighbor counter
j check_neighbors
after_neighbors:
beqz s2, dfs_return # If no unvisited neighbors were found, return without printing arrow
dfs_return:
lw s2, 0(sp)
lw s1, 4(sp)
lw s0, 8(sp)
lw ra, 12(sp)
addi sp, sp, 16
ret
-
main
function:
The main
function serves as the program's entry point and is responsible for overall control flow. It starts by printing an initial message and initializing the visited
array to mark all nodes as unvisited. Then, it begins the depth-first search (DFS) traversal by calling the dfs
function starting from node 1. After the traversal is complete, the main
function prints a newline and terminates the program using a system call.
-
dfs
function:
The dfs
function implements the recursive depth-first search. Upon entering the function, it saves the current node and register states onto the stack to preserve the context during recursive calls. It then checks if the current node has already been visited. If not, it marks the node as visited and prints its value. The function proceeds to iterate over the neighbors of the current node. For any unvisited neighbor, it recursively calls dfs
to continue the traversal. Once all neighbors of the current node are processed, the function restores the saved register states and returns to the previous level.
-
Neighbor Checking Logic:
The neighbor-checking logic, embedded within the dfs
function, is responsible for iterating over all neighbors of the current node. Using the adjacency matrix, the function calculates each neighbor's position and checks if there is an edge connection (matrix value of 1) and whether the neighbor is unvisited. If both conditions are met, the program outputs an arrow symbol to indicate a connection and recursively calls dfs
to traverse that neighbor. Once all neighbors are processed, the checking logic ends, and dfs
starts to backtrack.
Optimized DFS
.data
adj_matrix:
.word 0,1,1,0,0
.word 1,0,1,0,0
.word 1,1,0,1,1
.word 0,0,1,0,0
.word 0,0,1,0,0
visited: .word 0,0,0,0,0
msg1: .string "DFS starting from node: "
arrow: .string " -> "
newline: .string "\n"
.text
.global main
main:
la a0, msg1
li a7, 4
ecall
li a0, 1
li a7, 1
ecall
la a0, newline
li a7, 4
ecall
la t0, visited
sw zero, 0(t0)
sw zero, 4(t0)
sw zero, 8(t0)
sw zero, 12(t0)
sw zero, 16(t0)
li a0, 1
jal optimized_dfs
la a0, newline
li a7, 4
ecall
li a7, 10
ecall
optimized_dfs:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
sw s1, 4(sp)
sw s2, 0(sp)
mv s0, a0
la t0, visited
slli t1, s0, 2
add t1, t0, t1
li t2, 1
sw t2, 0(t1)
mv a0, s0
li a7, 1
ecall
li s1, 0
check_neighbors:
li t0, 5
beq s1, t0, dfs_return
la t0, adj_matrix
li t1, 20
mul t1, s0, t1
add t0, t0, t1
slli t1, s1, 2
add t0, t0, t1
lw t2, 0(t0)
beqz t2, next_neighbor
la t0, visited
slli t1, s1, 2
add t1, t0, t1
lw t2, 0(t1)
bnez t2, next_neighbor
la a0, arrow
li a7, 4
ecall
mv a0, s1
jal optimized_dfs
next_neighbor:
addi s1, s1, 1
j check_neighbors
dfs_return:
lw s2, 0(sp)
lw s1, 4(sp)
lw s0, 8(sp)
lw ra, 12(sp)
addi sp, sp, 16
ret
-
main
function:
The main
function serves as the entry point of the program and coordinates the execution flow. It starts by printing a message indicating the start of DFS traversal and initializes the visited
array, marking all nodes as unvisited. It then begins the traversal from node 1 by invoking the optimized_dfs
function. Once the traversal is complete, the function prints a newline and exits the program using a system call.
-
optimized_dfs
function
The optimized_dfs
function is the recursive implementation of depth-first search. It begins by saving the return address and relevant registers onto the stack to preserve the context during recursion. The current node is then marked as visited in the visited
array, and its value is printed. The function iterates through all potential neighbors of the current node by consulting the adjacency matrix. If a neighbor is connected (edge exists) and unvisited, the function prints an arrow symbol to indicate traversal and recursively calls itself to explore the neighbor. Once all neighbors are processed, the function restores the saved register values from the stack and returns to the previous level of recursion.
-
Neighbor Checking Logic:
The neighbor-checking logic is embedded within the optimized_dfs
function. It calculates the position of each neighbor in the adjacency matrix by determining the row and column offsets for the current node. For each neighbor, it checks if there is a connecting edge and whether the neighbor has already been visited. If both conditions are satisfied, the neighbor is identified as an unvisited connection, and the traversal proceeds to recursively explore it. This logic ensures that all reachable nodes are explored while avoiding revisits to already visited nodes.
Comparison of optimized DFS and DFS
1. Efficiency of Memory Address Calculation
Opt DFS
:
-
The base address of visited
(la t0, visited
) is calculated only once and stored in a register. Subsequent operations reuse this base address, reducing redundant calculations.
-
The offset for the current node (s0
) is calculated once and used directly in subsequent memory access.
DFS
:
- The base address of
visited
is recalculated for every neighbor, leading to unnecessary computations and additional memory access.
2. Efficiency of Adjacency Matrix Address Calculation
Opt DFS
:
- The base address of the adjacency matrix row for the current node is calculated only once. The result is stored in a register and reused for each neighbor, avoiding repeated calculations.
- For each neighbor check, only the column offset needs to be added, making it efficient.
DFS
:
- The row base address and column offset are recalculated for every neighbor check, introducing redundant operations and memory accesses.
3. Cache Hit Rate and Access Patterns
Opt DFS
:
- The base addresses of the adjacency matrix and
visited
array are reused multiple times, exhibiting temporal locality in cache usage.
- Accesses to the adjacency matrix and
visited
array are predictable and concentrated, resulting in a high cache hit rate.
DFS
:
- Recalculating the base addresses and offsets for every neighbor leads to more scattered memory access patterns.
- These less predictable memory access patterns result in lower cache hit rates and increased main memory access.
4. Simplified Conditional Logic
Opt DFS
- The conditional check for whether a neighbor is visited is separated from address calculations. It operates directly on already-computed values in registers, reducing complexity.
DFS
:
- Conditional checks are intertwined with redundant address calculations, leading to additional overhead during each neighbor check.
Execution Info
|
DFS |
Opt DFS |
Cycles |
777 |
744 |
Instrs. retired |
579 |
555 |
CPI |
1.34 |
1.34 |
IPC |
0.745 |
0.746 |
Opt DFS
consumes fewer CPU cycles because it eliminates redundant calculations and minimizes memory accesses by efficiently reusing values stored in registers. For example, the base address of the adjacency matrix for a node is computed only once and stored in a register, which is then reused for all neighbor checks. This avoids recalculating the same base address multiple times, as seen in DFS
. Similarly, the base address of the visited
array is calculated once and reused, reducing the overhead associated with repeated address computations.
In DFS
, every neighbor check recalculates the row base address and column offset for the adjacency matrix and re-accesses the visited
array's base address. These repeated operations increase the number of executed instructions, leading to more cycles. Additionally, Opt DFS's
streamlined structure allows for better instruction-level parallelism, as the CPU can execute fewer dependent instructions in sequence.
By reducing redundant operations and leveraging efficient memory and register usage, Opt DFS
lowers the total instruction count and improves execution efficiency, directly reducing the number of CPU cycles required.
Cache configuration
|
|
Repl.policy |
LRU |
Wr. hit |
Write-back |
Wr. miss |
Write allocate |
Statistics
|
DFS |
Opt DFS |
Hit rate |
0.8667 |
0.8588 |
Hits |
78 |
73 |
Misses |
12 |
12 |
It’s evident that while Opt DFS
is optimized, its cache hit rate is slightly lower than DFS
. This is because Opt DFS's
optimization focuses on significantly reducing the total number of memory accesses, whereas DFS
performs far more frequent memory accesses due to redundant calculations and repeated base address recalculations. These additional memory accesses in DFS
increase its cache hit count and hit rate, but they also introduce unnecessary overhead.
On the other hand, both implementations have the same number of cache misses, which suggests that they both encounter memory blocks that cannot fully benefit from the cache. This is likely due to the adjacency matrix and visited
array involving memory blocks that lack spatial locality, leading to some unavoidable cache misses.
The critical difference lies in the total number of memory accesses. Opt DFS
reduces the number of accesses by reusing computed addresses stored in registers, while DFS
repeatedly calculates and accesses memory for each neighbor. As a result, Opt DFS
achieves fewer overall memory accesses and therefore consumes fewer CPU cycles, making it much more efficient than DFS
.