pthread_join: 某執行緒暫停執行直到另一直行緒結束
Mutex: 同一時間內只有一執行緒可以保有該 lock 及保護資料的存取權
PTHREAD_MUTEX_INITIALIZER
pthread_mutex_init()
pthread_mutex_lock()
pthread_mutex_unlock()
Condition variable: 將事件實質化,並提供函式喚醒等待該事件的執行緒
PTHREAD_COND_INITIALIZER
pthread_cond_init()
pthread_cond_wait()
pthread_cond_timewait()
pthread_cond_signal()
pthread_cond_broadcast()
pthread_once: 保證初始函式被許多直行緒呼叫時僅執行一次
PTHREAD_ONCE_INIT
一種指標,將資料和執行緒進行關聯
PTHREAD_CANCEL_DISABLE
PTHREAD_CANCEL_ENABLE
PTHREAD_CANCEL_ASYNCHRONOUS
:立刻取消PTHREAD_CANCEL_DEFERRED
:當執行到 Cancellation-point 時發生pthread_cond_wait(), pthread_cond_timewait(), pthread_join()
pthread_testcancel()
One instruction is executing, the next instruction is being decoded, and the one after that is being fetched…
At the beginning of each clock cycle, the data and control information for a partially processed instruction is held in a pipeline latch. An output just in time to be captured by the next pipeline latch at the end of the clock cycle.
Since the result is available after the execute stage, the next instruction ought to be able to use that value immediately. To allow this, forwarding lines called bypasses are added
The logic gates that make up each stage can be subdivided, especially the longer ones, converting the pipeline into a deeper super-pipeline with a larger number of shorter stages.
The execute stage of the pipeline consists of different functional units (each doing its own task)
We can execute multiple instructions in parallel with the fetch and decode/dispatch stages enhanced to decode multiple instructions in parallel and send them out to the "execution resources."
There are independent pipelines for each functional unit
The issue width is less than the number of functional units.
The number of instructions able to be issued, executed or completed per cycle is called a processor's width.
Superpipeline + Superscalar (just called superscalar for short)
The instructions are groups of little sub-instructions
Compiler needs to insert the appropriate number of cycles between dependent instructions if necessary.
The number of cycles between when an instruction reaches the execute stage and when its result is available to be used is called the instruction's latency
The deeper pipeline could easily get filled up with bubbles due to instructions depending on each other
Latencies for memory loads are particularly troublesome
When processor encounters a conditional branch, it must make a guess to prevent losing performance gained from pipeline.
Those instructions will not be committed until the outcome of the branch is known.
How the processor make the guess
Deep pipelines suffer from diminishing returns
The deeper the pipeline, the further into the future you must predict, the more likely you'll be wrong, and the greater the mispredict penalty.
cmp a, 7 ; a > 7 ?
ble L1
mov c, b ; b = c
br L2
L1: mov d, b ; b = d
L2: ...
Simplifiled with predicated instruction:
Predicated instruction: works by executing as normal, but only commits if condition is true
cmp a, 7 ; a > 7 ?
mov c, b ; b = c
cmovle d, b ; if le, then b = d
Always doing the first mov then overwriting it if necessary.
If the blocks of code in the if and else cases were longer, using predication would mean executing more instructions than using a branch
Find a couple of other instructions from further down in the program to fill the bubbles caused by branches and long-latency instructions in the pipeline(s)
Two ways to do that:
1. Reorder in hardware at runtime:the dispatch logic must be enhanced to look at groups of instructions and dispatch them out of order.
Register renaming:
A larger set of real registers extract even more parallelism out of the code
2. The compiler optimizes the code by rearranging the instructions, called static, or compile-time, instruction scheduling.
Without OOO hardware, the pipeline will stall when the compiler fails to predict something like a cache miss
brainiac vs speed-demon debate:
Whether the costly out-of-order logic is really warranted, or compilers can do well enough without it
Increasing the clock speed of a processor will typically increase its power usage even more
Power increases linearly with clock frequency, and increases as the square of voltage
Power wall: not possible to provide much power and cooling to a silicon chip in any practical fashion
ILP wall: normal programs don't have a lot of fine-grained parallelism
Problem: the complex and messy x86 instruction set.
Solution: Dynamically decode the x86 instructions into RISC-like micro-instructions (μops), then executed by a RISC-style register-renaming OOO superscalar core.
Improvement:
Pipeline depth
If additional independent instructions aren't available, there is another potential source of independent instructions – other running programs, or other threads within the same program
-> To fill those empty bubbles in the pipelines
Simultaneous multi-threading (SMT): one physical processor core to present two or more logical processors to the system
Should not be confused with multi-processor or multi-core processor, but there's nothing preventing a multi-core implementation where each core is an SMT design.
The Pentium 4 was the first processor to use SMT, which Intel calls hyper-threading.
The complex multiple-issue dispatch logic scales up as roughly the square of the issue width (n candidate instructions compared against every other candidate)
For applications with lots of active but memory-latency-limited threads more simple cores would be better
For most applications, there simply are not enough threads active, and the performance of just a single thread is much more important
Rather than looking for ways to execute groups of instructions in parallel, SIMD make one instruction apply to a group of data values in parallel.
More often, it's called vector processing.
The same operation as a 32-bit addition, except that every 8th carry is not propagated.
It is possible to define entirely new registers -> more data to be processed in parallel
Loads tend to occur near the beginning of code sequences (basic blocks), with most of the other instructions depending on the data being loaded -> hard to achieve ILP
The facts of nature hinders fast memory system
Memory wall: the gap between the processor and main memory
A cache is a small but fast type of memory located on or near the processor chip, used to solve the problem of the memory wall
Memory hierarchy: The combination of the on-chip caches, off-chip external cache and main memory… (lower level -> larger but slower)
Caches achieve amazing hit rates because most programs exhibit locality
A cache works like a two-column table
Cache Lookup
A cache usually only allows data from any particular address in memory to occupy one, or at most a handful, of locations within the cache
Cache Conflict: memory locations mapped to the same location are wanted at the same time
Thrashing: repeatedly accesses two memory locations which happen to map to the same cache line, and the cache must keep storing and loading from main memory
Associativity: The number of places a piece of data can be stored in a cache
Map
The transfer rate of a memory system is called its bandwidth