Try   HackMD

Parallel Computing & Algorithms



"xxx."
-- xxx (yyyy-yyyy)

Goal

Multicore design is one of modern features of computer systems. You can easily see at least 4 cores in commercial CPUs in your desktop computers, laptops, even smart phones (for example, Apple A16 has one 6 cores CPU). However, the programming beginners rarely have opportunities to exploit the performance benefits of multi-core hardware, letting alone get a deep understanding of contemporary computer architecture. In recent years, Artificial Intelligence (AI) has become a prominent field, heavily relying on graphics processing units (GPUs) or accelerators for machine learning (ML). For example, the latest NVIDIA graphics card, the RTX 4090, has 16384 CUDA cores. Utilizing such an immense number of computational units has become a crucial challenge. This course shows you how to utilize these computing units (CPUs and GPUs) to unleash their computational performance. We will cover the following topics in ths course:

  1. Modern computer architecture and job scheduling in operating systems
  2. Multiprocessing & multithreading programming in Java, C#, and Python
  3. Parallel algorithms
  4. CUDA in C++

現今多核心設計的中央處理器 (central processing unit, CPU) 隨處可見 ,例如桌上型電腦、筆記型電腦甚至是智慧型手機,核心數至少 4 核心 (cores) 起跳,多可高達 96 核心 (例如 AMD EPYC 9654),但初學程式設計的學員鮮少有機會利用到硬體多核所帶來的效能,甚至還沒有機會深入了解當代的電腦架構。近年人工智慧 (Artificial Intelligence, AI) 成為顯學,大量需要繪圖晶片 (graphics processing unit, GPU) 或者圖形加速器來進行機器學習 (machine learning, ML),例如當前最新的 NVIDIA 顯示卡 RTX 4090 擁有 16384 顆 CUDA 核心,如何使用如此龐大的計算單元變成一個重要的課題。本課程將介紹平行程式與平行演算法,學員在課程中預計能夠學習到下列項目:

  1. 當代硬體架構、作業系統 (以 Linux 為例)
  2. 多程序與多執行緒的架構與實作 (以 Java、C#、Python 為例)
  3. 平行演算法
  4. CUDA (以 C++ 為例)

Instructor Information

Textbook

TBA

Programming Languages

  • You can pick up programming languages like C, C++, C#, Java, and Python mentioned in this course.
  • You can find installation instructions here.

Syllabus

Background Knowledge

Type of Parallelism

  • Type fo tasks
    • CPU-bound tasks
    • I/O-bound tasks
      • For example, some operations of databases, web crawling.
  • Parallelism in hardware
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →
    • Parallelism in a uniprocessor
    • SIMD instructions, vector processors, GPUs
    • Multiprocessors
      • Symmetric shared-memory multiprocessors
      • Distributed-memory multiprocessors
      • Chip-multiprocessors a.k.a. multicores
    • Multicomputers a.k.a. clusters
  • Parallelism in software
    • Instruction-level parallelism
    • Task-level parallelism
    • Data-level parallelism
    • Transaction level parallelis

Preliminary Knowledge about Hardware

Preliminary Knowledge about Software: Operating System

  • Program vs. Process

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

    https://techdifferences.com/difference-between-program-and-process.html

  • Job scheduling

  • Thread

    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

  • Parallelism and concurrency

    • Write to database in parallel or concurrently?
      • See atomicity of the ACID properties of database transactions.
  • Race condition

    ​​​​ int x = 0; ​​​​ x++; // thread-safe?
  • Deadlock

  • Amdhal's Law

    • Let
      α%
      be the parallelizable portion of one algorithm in percentage and
      p
      be the number of processors (or computing units of any type).
    • For convenience, the timecost before parallelizing the algorithm is
      T=100.
    • Then the timecost of the algorithm of parallel version is
      T=αp+100α.
    • The Amdhal's law states that the asymptotic limit of the speedup
      TT
      is
      limpTT=100100α<.
    • This implies that you cannot reduce the execution time to zero even if you can fully parallelize the algorithm.
    • In conclusions,
      • Parallelism benefits reduction of execution time until the marginal improvement starts to degrade as the non-parallelizable portion dominates gradually, one example for the law of diminishing returns.
      • Instead of buying more machines, it's better to hire programmers who are good at parallel algorithms.

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

Parallel Modules/Packages in Programming Languages

Parallel Algorithms

Practical Examples

Monte Carlo Simulation

  • It is an example of embarrassingly (ideally) parallel algorithms.

need quantitative benchmark

Compression & Decompression

  • Scenarios
    • Case 1: Decompressing 266 zip files for the TX futures. The total size of all csv files is 5.6 GB.
    • Case 2: Decompressing 254 zip files for all stocks in the Taiwan equity market. The total size of all csv files is 11 GB.
    • Case 3: Compressing nnnn csv files for all stocks in the Taiwan equity market.

      Scenario Sequential (1T) Parallel (48T)
      Case 1 36.915s 1.539s
      Case 2 1m13.223s 3.597s
      Case 3 TBA TBA

Web Crawling

need quantitative benchmark

Complexity Analysis

Remarks

  • Can parallism reduce the time complexity of one problem?
    • Definitely impossible unless you can supply processors/workers as many as the number of data size with a constant ratio.

Distributed Computing

References

Books

Operating Systems (OS)

Hardware Architectures

Parallel Algorithms

CUDA

Distributed Computing

Courses

Youtubes

Game

TODO