owned this note
owned this note
Published
Linked with GitHub
CS3210 README
===
### A brief description of your program’s design and implementation assumptions, if any.
Implementation (tuning): See starred factors
### A brief explanation of the parallel strategy you used in your OpenMP implementation, e.g. synchronisation, work distribution, etc.
If you used multiple methods to parallelize, briefly explain the reasons for choosing a specific method.
We decided on the paralllization method by applying foster's design methodology
#### Foster's Design Methodology
1. partiton the problem into smaller tasks
- Smallest unit of task: One square on the grid
- To evaluate a single square, it requires input from all 8 neighbors
2. Check how tasks communicate across each other (coordination)
- Shared memory (OpenMP does not support message passing)
- Synchronization method may be different (barriers, sem, mutex)
- Synchronization algorithm can be implemented differently (local communication vs global communication)
3. Put together small tasks into larger (coarse grained) tasks
- Aim of reducing communication, thread overhead
- Scalability: number of tasks should still increase as program size increases.
- Combine units into a larger chunk of squares -> reduce communication
- Factor to check: Shape? Row vs squares.
- Each chunk of square will communicate locally with it's neighbour chunk.
4. Map tasks to run on processors or cores with the goal of reducing execution time.
- pthreads implementation: We map by making a shared task pool
- OpenMP: compiler will map
- Additional Requirements: Fixed size of input at
- 0 ≤ N_GENERATIONS ≤ 1, 000, 000
- 1 ≤ N_ROWS ≤ 1, 000, 000
- 1 ≤ M_COLS ≤ 1, 000, 000
- 1 ≤ N_INVASION ≤ 1, 000
- We test by timeout of 15 mins?
#### Impl plan
1. Declare X, N as magic constants / read from varargs
1. Test which shape is better
- Squares:
- X (length) = {1, 3, 10, 30}
- N = {1, 3, 5}
- Rows:
- X (rows) = {1, 3, 10, 30, 100}
- N = {1, 3, 5}
1. Test if n matters, or is n=1 the best
1. Pthreads Impl 1:
1. create a pthread within the for loop for each block.
2. We need to store (before, after) states
3. Share read access across blocks in before state
4. Exclusive (not enforced but other threads will never access that index) write access to their block in after state
5. Swap pointers of (before, after) to and use the old before as the new after.
1. Pthreads Impl 2:
1. Try using recursive split for splitting blocks and merging n_death_toll
1. Pthreads Impl 3:
1. Worker pool + shared variable n_death_toll
2. openMP Impl 1:
1. Pragma directive directly on double for loops
2. >2x speedup
#### Any special consideration or implementation detail that you consider non-trivial.
- In your report, present and explain your insights for at least the following things:
- graphs showing the execution time (y-axis) variation with input size, number of threads (x-axis) (fixed input size) (threads in range 1-64)
- Other variables that contribute to performance
- Density of input examples:
1. Density of initial graph (tested on sequential algorithm):
- (as a randomly generated graph with 20%, 40%, 60%, 80% density)
- As there is no lazy evaluation, and all grid cells are checked, there is no significant runtime difference from having a sparse (many empty cells) vs a dense graph.
- In our future benchmarks, we will assume this is a nonimportant factor.
2. Density of invasion plan
- Invasion is a write operation on certain cells
- (1) Total number of invasions
- (2) Density of replaced cells per invasion
3. (*) Shape of Combined units
- Rows vs squares
- [Square superior in reducing communication cost](https://tcpp.cs.gsu.edu/curriculum/?q=system/files/ch10.pdf)
- [Row superior because accessing row elements is faster](http://www.shodor.org/media/content/petascale/materials/UPModules/GameOfLife/Life_Module_Document_pdf.pdf)
- Check against what we need
4. (*) Varying size of input for each thread
- Number of stages each task counts ahead for (n)
- Size of rows or squares (x)
- Square: Each thread gets (x+n)^2 input and (x^2) output
- 
- Rows: Each thread gets (x+n)*ROW_SIZE input and (x)*ROW_SIZE output
- Rows: make each row a certain fixed width to enable scaling
• Details on how to reproduce your results, e.g. inputs, execution time measurement, etc.
• Execution time and speedup measurements of your OpenMP implementation running one thread vs your parallel implementations. Vary input size and the number of threads, etc.
State your assumptions and mention the lab machine (hostnames) that you have used in your experiments.