contributed by < BigMickey69 >
目標:
第一件事是將程式碼整合成一 .c 檔,再將空格 AAAA 到 EEEE 全數填完後,再補上缺的 c 標頭檔。幾週前的小考因此容易忘記空格答案,執行後有錯就更動,反覆操作直到 debug 完成。
教授幽默人又好,提供的程式碼很貼心的已經附帶每個函式的 testcase ,若跑不過就能知道有問題
Image Not Showing Possible ReasonsLearn More →
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
通過截圖
為什麼要使用 mpi ?
c 內建的預設整數 int 只有32位元,因此只能表示2147483647 ~ –2147483648 的數字,即便用 unsigned 最大也只能表示 4294967295 。
若希望表示更大的整數則可使用 long long ,一共有64個位元可運用,將可表示整數的範圍擴充到 -2^31 to 2^31 – 1,但這樣還是會不夠用,而且如果每個整數都用 long long 來表示,則會造成很多不必要的記憶體空間浪費
mpi 全名 multi-precision integers,自訂的整數表示型別,利用 uint32_t 將記憶體分割成32位元的等分,而每格只使用31位元儲存資訊,需要時擴充、不必要時縮小空間,可有效率且高兼容的做整數運算
不過最大的缺點是速度比上內建整數運算慢了好幾倍…
定義一新 struct 叫 mpi_t,裡面有一 uint32_t 指標來儲存資料、一 size_t 來紀錄大小
(我自己會把 size_t 改用 short int 來進一步節省記憶體,不過這樣或許會慢一些,每次進行操作時處理器得額外花時間將 short int 擴充至32位元)
typedef struct {
uint32_t *data;
size_t capacity;
} mpi_t[1];
寫 [1] 的好處是寫程式時不必在每次傳入 mpi_t 時使用 & 符號,矩陣本身就是個指標,而在創建此物件時編譯器會自動分配一點記憶體過去,省下 malloc 指標的必要,為寫程式的一種小技巧
雖然 data 指標中每格為32位元,但為了避免 overflow 的發生而書改到違法的記憶體,32位元中只會使用31為原來儲存資料
mpi_init 初始化元素,mpi_clear 釋放記憶體。
mpi_enlarge 使用 realloc 將擴充元中 data 指標配置的記憶體量到指定大小,並將新的記憶體設為 0
mpi_compact 從 capacity 由後往前看 data 的每一格,若為 0 則釋放並紀錄新大小
mpi_set_u32 將 uint32_t 的整數存為 mpi_t 元素。如前面所述,因 uint32_t 為32位元而 data 中每格只能儲存 31 位元,因此還是得用 2 格來儲存資料。先存下31位元後再將數字向右移動 >>= 32 位元
mpi_set_u64 與 mpi_set_u32 同理但 data 會使用到 3 格,因 ceil_div(64,31) = 3
mpi_get_u64 將 mpi_t 轉為 uint64_t ,若大小超出 64 位元則無條件捨去掉,每次儲存31位元進 uint64_t 後再往左 bitshift <<= 31 位元繼續儲存
mpi_get_u32 與 64 同理不過會在 32 位元時就無條件捨去
#define INTMAX 0x7fffffff
// 只有 MSB 為1、其餘為0,為 mpi_t 中31位元的一格能表示的最大整數
mpi_add 實現兩 mpi_t 之間的加法運算,capacity 取兩數較大的(假設 n )。從n開始,取數字1與數字2中那個的31位元出來,儲存在 uint32_t 之中,兩數與變數 c 做家法,並紀錄溢出的部份於變數 c ,最後再將答案 AND INTMAX 來確保 MSB 為0,操作至第一格
mpi_sub 與 mpi_add 同理不過改為減法,一樣有變數 c來紀錄 carry
mpi_add_u64 將 mpi_t 元素加上 uint64_t,其實邏輯上與 mpi_add 一樣,只是會用 INTMAX 來 mask 前31位元以外的所有位元(當作是一格 data 操作),一次加法結束後再向右平移31位元,反覆操作
mpi_sub_u32 與 mpi_add_X 雷同,不過改用減法(注意:不支援負數!)
mpi_mul_naive 實現兩 mpi_t 之間的乘法運算,至於為什麼是 naive (天真),是因為邏輯上就跟我們人在紙上用的直式一樣,隨著數字越來越大,計算量會迅速膨脹(時間複雜度:O(n^2))。
mpi_mul_karatsuba 同樣計算兩 mpi_t 之間的乘法,但時間複雜度只有 ~O(n^1.585) 。卡拉楚巴演算法利用divide‑and‑conquer(分治)的方法,它比傳統的 O(n²)「長乘法」更快,運算複雜度降到約 O(n^1.585)。
Image Not Showing Possible ReasonsLearn More →
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
非常糟糕的手繪示意圖
下半部 (x0, y0):數字的下 31×m bits
上半部 (x1, y1):數字的上31×m bits
再用兩個 helper 函式協助, mpi_fdiv_r_2exp(…, 31*m) 取餘數,mpi_fdiv_q_2exp(…, 31*m) → 取商
名稱 | 運算 | 含義 |
---|---|---|
z2 | x1 × y1 | 上半部×上半部 |
z0 | x0 × y0 | 下半部×下半部 |
z1 | (x1+x0) × (y1+y0) | 合併後乘積 |
z1 扣掉 z2 和 z0 為中間項
mpi_cmp 比較兩 mpi_t 元素的大小,若 capacity 不相同則很好比較,相同的話再將最左邊的31位元抓出來存進32位元整數再進行比較
mpi_setbit 利用mask主動顛倒指定位置的位元,1U 往左 <<= 指定位元數創建 mask 再用 OR 操作
最後,mpi_gcd 為利用 gcd 中的輾轉相除法與遞迴找出最大公因數
使用遞迴的函式只有 mpi_mul_karatsuba 與 mpi_gcd 兩個,若要mpi_mul_karatsuba 不使用遞迴會非常的複雜,得用好幾個堆疊 stack 去紀錄每個代做操作,要碼拆分要碼合併,紀錄每個要操作的元素與要操作的 data 部份、做 bit shift 並且記著原先的數值等等等
不管是速度上還是記憶體,感覺都不會勝於遞迴的版本,因此這邊專注於 gcd 函式的優化
mpi_gcd2:
void mpi_gcd2(mpi_t rop, const mpi_t op1, const mpi_t op2)
{
if (op2->capacity != 0 && mpi_cmp_u32(op2, 0) == 0) {
mpi_set(rop, op1);
return;
}
mpi_t dividend, divisor, r;
mpi_init(dividend);
mpi_init(divisor);
mpi_init(r);
mpi_set(dividend, op1);
mpi_set(divisor, op2);
while(divisor->capacity != 0 && mpi_cmp_u32(divisor, 0) != 0){
mpi_fdiv_r(r, dividend, divisor);
//printf("%d / %d = eh ... %d\n",
//dividend->data[0], divisor->data[0], r->data[0]);
mpi_set(dividend, divisor);
mpi_set(divisor, r);
//printf("new %d / %d\n", dividend->data[0], divisor->data[0]);
}
mpi_set(rop, dividend);
//printf("final: %d\n", rop->data[0]);
mpi_clear(dividend);
mpi_clear(divisor);
mpi_clear(r);
}
邏輯簡單,divisor 等於0之前不斷的輾轉相除。新增了一新 helper 函式 mpi_fdiv_r() ,邏輯上與 mpi_fdiv_qr() 一樣不過不會設定商(q)
遇到問題:
while(mpi_cmp_u32(divisor, 0) != 0) { ...
mpi_fdiv_r(r, dividend, divisor); 似乎會在 r 為 0時將它設為 NULL ,這樣確實能省下一些記憶體,但就是有點怪,尤其是因為在原 gcd 函式中也有用 mpi_cmp_u32 來與 0 比對
多一部分偵測 capacity 是否為0就解決了
從現在開始我會用英文撰寫共筆
Goal:
下一步:貢獻程式碼到 Linux 核心
Phoronix 報導
Very interesting presenation on SWAR & SIMD
(C++ on Sea conference)
First we must understand what exacty IS SIMD and SWAR
SIMD (Single Instruction Multiple Data),as opposed to SISD (Single Instruction Single Data) is when we process multiple data just with a singular instruction
Simply, SIMD is where you perform the same operation on multiple data in parallel.
Let's take taking the square of elements in an array as an example:
// Serial
void square(int *arr, size_t n) {
for (size_t i = 0; i < n; ++i)
arr[i] = arr[i] * arr[i];
}
// SIMD
#include <immintrin.h>
void simd_square_int(int *arr, size_t n) {
size_t i = 0;
// using AVX2 to process eight 32‑bit ints per __m256i
const size_t stride = 8;
for (; i + stride <= n; i += stride) {
__m256i vec = _mm256_loadu_si256((__m256i*)(input + i));
__m256i squared = _mm256_mullo_epi32(vec, vec);
_mm256_storeu_si256((__m256i*)(output + i), squared);
}
// Scalar fallback for remaining elements
for (; i < n; i++) {
arr[i] = arr[i] * arr[i];
}
}
AVX2:
AVX2 (Advanced Vector Extensions 2) is an x86 CPU instruction‑set extension that provides 256‑bit SIMD support. Supporting gather/scatter memory operations and vector arithmetics, logical, and comparison instructions. Allowing a single instruction to process up to eight 32‑bit values in parallel, reducing instruction count and boosting throughput for data‑parallel tasks.
:::
A very basic concept is using a single-lane as multi-lanes. ex, using 16 bits as four 4 bit lanes, let's say we wanted to check if two sets of four 4 bit intergers/char are equal:
(SWAR example)
A: 0001 | 0010 | 0011 | 0100
B: 0100 | 0010 | 1011 | 0111
A -> 0101 | 0000 | 1000 | 0011
Lanes that are all 0 indicate a match
B -> 1000 | 1000 | 1000 | 1000
A -> 1101 | 1000 | 1000 | 1011
A: 1101 | 1000 | 1000 | 1011
-0001 | 0001 | 0001 | 0001
= 1??? | 0111 | 0??? | 1???
Possible misjudge on lane 3
RESULT: if MSB of said lane is 1, then it's a match
void *memchr(const void *s, int c, size_t n)
{
const unsigned char *p = s;
while (n-- != 0) {
if ((unsigned char)c == *p++) {
return (void *)(p - 1);
}
}
return NULL;
}
This code does the job in a more tradional way, it loops through the array and checks each 4 bits whether it's equal to varaible c past in the function. Though it is clever with how it returns the address of the match.
// SWAR approach
void *memchr_swar(const void *s, int c, size_t n) {
const unsigned char *p = s;
unsigned char uc = (unsigned char)c;
// SWAR processes are slow if not byte aligned, so we'll check the traditional way
while (n > 0 && ((uintptr_t)p % sizeof(size_t)) != 0) {
if (*p == uc)
return (void *)p;
p++;
n--;
}
if (n >= sizeof(size_t)) {
size_t repeated = uc;
repeated |= repeated << 8;
repeated |= repeated << 16;
if(SIZE_MAX == UINT64_MAX)
repeated |= repeated << 32;
// Magic constants!
size_t ones, highs;
if (sizeof(size_t) == 8) {
ones = 0x0101010101010101ULL;
highs = 0x8080808080808080ULL;
} else {
ones = 0x01010101U;
highs = 0x80808080U;
}
while (n >= sizeof(size_t)) {
size_t word = *(const size_t *)p;
// XOR to find match, 0000 if a match
size_t diff = word ^ repeated;
// SWAR zero-detection
if (((diff - ones) & ~diff & highs) != 0) {
// find actual location if there's a match
for (size_t i = 0; i < sizeof(size_t); i++) {
if (p[i] == uc)
return (void *)(p + i);
}
}
p += sizeof(size_t);
n -= sizeof(size_t);
}
}
// Process any remaining bytes.
while (n--) {
if (*p == uc)
return (void *)p;
p++;
}
return NULL;
}
}
We check in 32 bit chunks(compiling in 32-bit) for matching unsigned char (8 bits).
After making sure that bytes are aligned (impacts efficiency greatly if not), we use logic from earlier but streamlined to check whether a match exists within 32 bits.
If there are any remaining bytes after SWAR, we make sure to also check them through the tried and true way.
We'll go through 100,000 iterations, with each pointer's minimum size at 1024 (1KiB), maximum at 102410248 (8MiB), slowly getting larger each iteration.
Randomly fill said arrays using rand(), with each iterations set the target at the end.
e.g. int target = buffer[size - 1]
Finally, we write all the data into .csv form and graph said data using matplotlib:
We can see that memchr_SWAR is a on average much more stable and faster than memchr, but we can't really see a steady increase in time or performance with the increase in buffer size. There may be a bit of noise in these results, though I will point out that both functions are ran in the same environment, so it's still a fair compaison giving us an idea of the performance difference.
size, memchr, memchr_SWAR
...
1795072,0.007167,0.002484
1803264,0.080287,0.017096
1811456,0.079536,0.017788
1819648,0.069977,0.012703
...
Looking at the .csv file manually, you'd find that almost every
record of memchar_SWAR is ~3x to 5x faster than memchr
Average memchr: 0.057593 seconds
Average memchr_SWAR: 0.011527 seconds
補充: 寫完才發現跟 quiz 中的程式碼幾乎一樣…,但確實是自己拼湊 + 查資料出來的結果
The first step of any assignment is too comprehend the code at hand, but too often I forget to log the steps I took to try and understand the code.
Since there's no additional assignments for this quiz, I'll walk through my process through enlightenment:
Env is the environment, storing info regarding the current thread. While ucontext_t contains execution context.
the line
enum { ENV_UNUSED, ENV_RUNNABLE, ENV_WAITING };
declares 3 anonymous enumerations.
My question upon seeing this is "What's the point?, Why not just use macros?"
My guess of why: Declaration can be done in just 1 line, automatic numbering, Scoped declaration
static int curenv;
The variable "curenv" is a global variable that keeps track of what Env is currently running.
static ucontext_t exiter = {0};
This declares a new ucontext_t object while quickly setting numbers to 0, pointers to NULL. (since NULL and 0 are the same thing in assembly)
const struct itimerspec ts = {
{0, 0},
{0, 10000000},
};
It's structure is defined as so:
struct timespec {
time_t tv_sec; // tv stands for "time value"
long tv_nsec;
};
// it stands for "interval timing"
struct itimerspec {
struct timespec it_interval; // interval between firing
struct timespec it_value; // time before first fire
};
So this means that ts doesn't repeat, fires after 10 ms
Stands for memory map. It's a system call that maps files or anonymous memory into a process's virtual address space, giving direct access to processes in user space.
...
// AAAA:
envs[env].status = ENV_RUNNABLE;
...
...
// BBBB
int candidate = (curenv + 1) % NENV;
...
...
// CCCC
if (envs[curenv].state_reentered++ == 0) {
/* Context successfully saved; schedule the next user-level thread to
* run.
*/
coro_schedule();
}
...
...
// DDDD
static void preempt(int signum UNUSED,
siginfo_t *si UNUSED,
void *context UNUSED)
{
coro_yield();
}
...
struct sigaction {
void (*sa_handler)(int);
void (*sa_sigaction)(int, siginfo_t *, void *);
sigset_t sa_mask;
int sa_flags;
void (*sa_restorer)(void);
};
Explained by the prof in the section before the quiz.
CRC or Cyclic Redundancy Check is a common way to check if data is intact, most commonly seen in web/server data applications.
First time seeing it I had no idea what I was looking at, the code and the logic just didn't make any sense in my head. Upon revisiting the concept for week 4's assignment however, I already have a better grasp.
// naive implementaion
uint32_t crc32_naive(uint32_t crc, uint8_t v)
{
crc ^= v;
for (int bit = 0; bit < 8; bit++) {
if (crc & 1)
crc = (crc >> 1) ^ UINT32_C(0x82f63b78);
else
crc = (crc >> 1);
}
return crc;
}
The first thing to keep in mind is that the bits are reversed in this application. There are a few benefits that comes with this:
Let's say we're processing 8 bits of data like the example above, xor happens and bit shifts happen only 8 times. Thus really, only the first 8 bits of CRC actually influenced the result.
This 2^8 = 256, thus we can use a lookup table to negate the need for calculating all together.
static const uint32_t crc32c_table[256] = {
0x00000000L, 0xF26B8303L, 0xE13B70F7L, 0x1350F3F4L,
...
0xBE2DA0A5L, 0x4C4623A6L, 0x5F16D052L, 0xAD7D5351L
};
uint32_t crc32c(uint32_t crc, const uint8_t *data, unsigned int length)
{
while (length--) {
crc = crc32c_table[(crc ^ *data++) & 0xFFL] ^ (crc >> 8);
}
return crc ^ 0xffffffff;
}
crc32c does crc on every byte of data, while examples earlier only focused on 1 byte.
Since the bits are reversed, we XOR with 0xFFL at the end to reverse the result to grab the index for the lookup table.
uint32_t crc32_u8(uint32_t crc, uint8_t v)
{
crc ^= v;
static const uint32_t crc32_table[] = {
0x00000000, 0x105ec76f, 0x20bd8ede, 0x30e349b1,
0x417b1dbc, 0x5125dad3, 0x61c69362, 0x7198540d,
0x82f63b78, 0x92a8fc17, 0xa24bb5a6, 0xb21572c9,
0xC38D26C4, 0xD3D3E1AB, 0xE330A81A, 0xF36E6F75
};
crc = (crc >> 4) ^ crc32_table[crc & 0x0F];
crc = (crc >> 4) ^ crc32_table[crc & 0x0F];
return crc;
}
For a half-byte lookup table, we just change the implementation from checking 8 bits at once -> checking 4 bits, twice.
Goal:
Q. How do we find exactly where crc32 is used in the Linux kernal?
A. Straight up download the kernal and search it through!
Steps took:
bigmickey@Linuxer:~$ sudo apt-get install linux-source
bigmickey@Linuxer:~$ cd /usr/src
bigmickey@Linuxer:~$ sudo tar -xf linux-source-6.8.0.tar.bz2
-> When we downloaded "linux-source", it downloaded a compressed file that contains everything. Thus we must use a tool named "tar" to unpack everything
bigmickey@Linuxer:~$ cd linux-source-6.8.0
Finally, the we seach through the files using the command grep (global regular expression print) for "crc32"
bigmickey@Linuxer:/usr/src/linux-source-6.8.0$ grep -Rn "crc32" kernel/
Output screen(partial):
With this, we're able to find the exact files and places where crc32 is used.
(The name playstation caught my attention)
What is it?:
So how is CRC32 used?
CRC32 is used mainly to verify data integrity
of Bluetooth for the controller
struct dualshock4_input_report_bt {
uint8_t report_id; /* 0x11 */
uint8_t reserved[2];
struct dualshock4_input_report_common common;
uint8_t num_touch_reports;
struct dualshock4_touch_report touch_reports[4]; /* BT has 4 compared to 3 for USB */
uint8_t reserved2[2];
__le32 crc32;
} __packed;
Many structures like this one ARE the structures of data that's transmitted wirelessly through bluetooth, each one containing CRC32. The function ps_check_crc32() checks if CRC is correct, if not the input is ignored. (return -EILSEQ)
What is it?:
CRC32 is used to verify data integrity, detecting corruption in Fastmap data when it is read from flash.
Since flash devices (e.g. 隨身碟) are prone to bit flips, partial writes, or wear and tear corruption
...
crc = crc32(UBI_CRC32_INIT, ubi->fm_buf, fm_size);
if (crc != tmp_crc) {
ubi_err(ubi, "fastmap data CRC is invalid");
ubi_err(ubi, "CRC should be: 0x%x, calc: 0x%x",
tmp_crc, crc);
ret = UBI_BAD_FASTMAP;
goto free_hdr;
}
...
What is it?
Q. If CRC32 is used on litterally every block, wouldn't that takle too much computing and storage?
A. Yes, for the sake of striking a balance with efficiency, only metadata blocks are checked.
// One of the functions
int ocfs2_block_check_validate(void *data, size_t blocksize,
struct ocfs2_block_check *bc,
struct ocfs2_blockcheck_stats *stats)
{
int rc = 0;
u32 bc_crc32e;
u16 bc_ecc;
u32 crc, ecc;
ocfs2_blockcheck_inc_check(stats);
bc_crc32e = le32_to_cpu(bc->bc_crc32e);
bc_ecc = le16_to_cpu(bc->bc_ecc);
memset(bc, 0, sizeof(struct ocfs2_block_check));
/* Fast path - if the crc32 validates, we're good to go */
crc = crc32_le(~0, data, blocksize);
if (crc == bc_crc32e)
goto out;
ocfs2_blockcheck_inc_failure(stats);
mlog(ML_ERROR,
"CRC32 failed: stored: 0x%x, computed 0x%x. Applying ECC.\n",
(unsigned int)bc_crc32e, (unsigned int)crc);
/* Ok, try ECC fixups */
ecc = ocfs2_hamming_encode_block(data, blocksize);
ocfs2_hamming_fix_block(data, blocksize, ecc ^ bc_ecc);
/* And check the crc32 again */
crc = crc32_le(~0, data, blocksize);
if (crc == bc_crc32e) {
ocfs2_blockcheck_inc_recover(stats);
goto out;
}
mlog(ML_ERROR, "Fixed CRC32 failed: stored: 0x%x, computed 0x%x\n",
(unsigned int)bc_crc32e, (unsigned int)crc);
rc = -EIO;
out:
bc->bc_crc32e = cpu_to_le32(bc_crc32e);
bc->bc_ecc = cpu_to_le16(bc_ecc);
return rc;
}
ECC -> Error-Correcting Code.
Simplified steps for the function:
for (unsigned int j = 0; j < 8; j++)
crc = (crc >> 1) ^ (-(int)(crc & 1) & Polynomial);
By smartly utilizing bit ops, we're able to negate the need of an "if" branch OR "tenary" operators.
Logic:
This is a very fast way to calculate CRC without using lookup tables
Before getting into data, I want to break down one of the algorithms provided that caught my eye:
const uint32_t Polynomial = 0xEDB88320;
// -----------------------------------------------
// standard initialization
// approx. 1.5 nanoseconds on my PC / GCC9-x64
for (unsigned int i = 0; i <= 0xFF; i++)
{
uint32_t crc = i;
for (unsigned int j = 0; j < 8; j++)
crc = (crc >> 1) ^ ((crc & 1) * Polynomial);
Crc32Lookup[0][i] = crc;
}
// -----------------------------------------------
// code by a user named "Wolf"
// approx. 0.22 nanoseconds on my PC / GCC9-x64
Crc32Lookup[0][0] = 0;
// start with powers of two
uint32_t crc = Crc32Lookup[0][0x80] = Polynomial;
for (unsigned int step = 0x40; step != 0; step >>= 1)
{
// each power of two has only a single bit set
// so we need just one iteration
crc = (crc >> 1) ^ ((crc & 1) * Polynomial);
Crc32Lookup[0][step] = crc;
// and compute all not-yet-processed multiples
for (unsigned int i = step * 3; i <= 0xFF; i += step << 1)
Crc32Lookup[0][i] = crc ^ Crc32Lookup[0][i - step];
// note: step*2 was already filled, therefore start at step*3
// and compute all odd multiples
}
// -----------------------------------------------
// my improvement of Wolf's code
// approx. 0.18 nanoseconds on my PC / GCC9-x64
Crc32Lookup[0][0] = 0;
// compute each power of two (all numbers with exactly one bit set)
uint32_t crc = Crc32Lookup[0][0x80] = Polynomial;
for (unsigned int next = 0x40; next != 0; next >>= 1)
{
crc = (crc >> 1) ^ ((crc & 1) * Polynomial);
Crc32Lookup[0][next] = crc;
}
// the main idea is:
// table[a ^ b] = table[a] ^ table[b];
// therefore if we know all values up to a power of two called x
// we can compute table[x + y] where x > y:
// table[x + y] = table[x ^ y] = table[x] ^ table[y]
// example:
// the previous loop computed table[1] and table[2],
// so that x = 2 (a power of two) and y = 1 (fulfilling x > y)
// now x + y = x ^ y = 2 + 1 = 2 ^ 1
// and table[x + y] = table[x ^ y] = table[x] ^ table[y]
// which means table[3] = table[2] ^ table[1]
// compute all values between two powers of two
// i.e. 3, 5,6,7, 9,10,11,12,13,14,15, 17,...
for (unsigned int powerOfTwo = 2; powerOfTwo <= 0x80; powerOfTwo <<= 1)
{
uint32_t crcExtraBit = Crc32Lookup[0][powerOfTwo];
for (unsigned int i = 1; i < powerOfTwo; i++)
Crc32Lookup[0][i + powerOfTwo] = Crc32Lookup[0][i] ^ crcExtraBit;
}
Breaking it down step by step:
// Normal way:
for (unsigned int i = 0; i <= 0xFF; i++) {
uint32_t crc = i;
for (unsigned int j = 0; j < 8; j++)
crc = (crc >> 1) ^ ((crc & 1) * Polynomial);
Crc32Lookup[0][i] = crc;
}
// Wolf's way
Crc32Lookup[0][0] = 0;
uint32_t crc = Crc32Lookup[0][0x80] = Polynomial;
for (unsigned int step = 0x40; step != 0; step >>= 1) {
crc = (crc >> 1) ^ ((crc & 1) * Polynomial);
Crc32Lookup[0][step] = crc;
for (unsigned int i = step * 3; i <= 0xFF; i += step << 1)
Crc32Lookup[0][i] = crc ^ Crc32Lookup[0][i - step];
}
By knowing CRCs values of some bit set, you're able to build the CRC of other values by mixing and matching sets.
- e.g. CRC[7] = CRC[4 ^ 2 ^ 1] = CRC[4] ^ CRC[2] ^ CRC[1]
!!!NOTE!!! : Mixing and matching only works with powers of 2!
Why? Because powers of 2 only have 1 bit set, thus we're able xor them together to create any number from 1 to 255 by "turning on" those bit positions.
step = 0x40(64) -> i = 192
step = 0x20(32) -> i = 96, 160, 224
step = 0x10(16) -> i = 48, 80, 112, 144, 176, 208, 240
…
eventaully filling out every slot!
Experiment:
I chose 4 functions to test and compare. The Original, Standard-table, Improved-branchless and Slicing-by-4
We run each function 5 times in varying sizes, from 1 Byte, 1 KB, 1 MB, 8 MB, 32 MB.
Very simple function to get the time difference:
double time_diff_in_sec(struct timespec start, struct timespec end) {
return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;
}
Results:
The original implementation is by far the slowest, though it may not seem very obvious due to how spread out the graph is. Since the orignial has to calculate every bit then shift through, it hogs too much time resulting in slow speeds. Branch-free gets rid of the if statement and used (int) type casting for clever gains. But these are results that are expected.
What's far more interseting is the performance difference between Standard table v.s. Slicing-by-4
// Standard-table generation (optimized)
void init_crc32_table() {
for (int i = 0; i < 256; i++) {
uint32_t crc = i;
for (int j = 0; j < 8; j++) {
if (crc & 1)
crc = (crc >> 1) ^ POLYNOMIAL;
else
crc = crc >> 1;
}
crc32_table[i] = crc;
}
}
// Slicing-4-table generation (optimized)
void init_crc32_slicing4_table() {
// First row: same as standard table.
for (int i = 0; i < 256; i++) {
crc32_slicing4_table[0][i] = crc32_table[i];
}
// Build rows 1..3.
for (int row = 1; row < 4; row++) {
for (int i = 0; i < 256; i++) {
uint32_t crc = crc32_slicing4_table[row - 1][i];
crc32_slicing4_table[row][i] = (crc >> 8) ^ crc32_table[crc & 0xFF];
}
}
}
Standard table is a table of size 256, while slicing4_table has of a whopping size of 4 * 256
While a bigger table eats up more memory, it also means less calculations are needed to get results. 4 bytes are processed simultaneously
Still, it pales in comparison to hardware accelerated CRC32 calculating. (Basically using lookup tables but using just hardware, achieving results in just 1 or 2 instructions.
How is hardware-accelerated CRC32 done on ARM64?
The circuits for CRC calculation are already built on Arm64 cpus, it's implemented uses a special shift register called LFSR (Linear Feedback Shift Register).
🧱 Circuit components:
Shift Register: Holds the current CRC value.
XOR Gates: Used to "divide" the data polynomial.
Couple wires.
So with dediated hardware, despite manually calculating CRC values, it is still MUCH faster than lookup tables in software.
Is it possible to use lookup tables with hardware?
Yes, but that's a bit hard to test, especially without a FPGA board. Maybe it's something I can return to in the future.
A basic Synthesizer in C.
code with blanks filled in
First, understanding each structure is in order:
typedef struct {
/* Holds node-specific state (e.g., phase or envelope state).
This is reset when a note is triggered (gate event). */
int32_t state;
q15_t *gain; /* Pointer to the gain input value. */
q15_t output; /* Output value of the node. */
synth_node_type_t type;
uint8_t param1; /* Optional: potentially used for controlling gain range. */
union {
synth_oscillator_t osc;
synth_envelope_t env;
synth_filter_t filter;
synth_mixer_t mixer;
};
} synth_node_t;
q15_t & q7_t:
q15_t synth_process()
{
int32_t main_output = 0;
for (int vi = 0; vi < SYNTH_VOICES; vi++) {
synth_voice_t *voice = &synth_voices[vi];
synth_node_t *nodes = voice->nodes;
/* First iteration: compute each node's output based on its current
* state. Temporarily store these calculated outputs before updating the
* node state.
*/
int32_t outputs[SYNTH_NODES];
for (int i = 0; i < SYNTH_NODES && nodes[i].type != SYNTH_NODE_NONE;
i++) {
synth_node_t *node = &nodes[i];
q15_t tmp;
switch (node->type) {
case SYNTH_NODE_OSCILLATOR:
tmp = node->state & 0x7FFF; /* Limit to positive q15 range */
outputs[i] = node->osc.wavegen(tmp);
break;
case SYNTH_NODE_ENVELOPE:
outputs[i] = (node->state & 0x7FFFFF) >> 4;
/* Apply squaring to the envelope */
outputs[i] = (outputs[i] * outputs[i]) >> 15;
/* If the sustain value is negative, invert the envelope output
*/
if (node->env.sustain < 0)
outputs[i] = -outputs[i];
break;
case SYNTH_NODE_FILTER_LP:
/* Compute low-pass filter output by scaling the accumulator
* with a filter factor. The accumulator holds the previous
* state, and this equation applies the filter's smoothing
* effect.
*/
outputs[i] = (node->filter.accum * node->filter.factor) >> 15;
break;
case SYNTH_NODE_FILTER_HP:
/* Compute as with low-pass filter then subtract from the input
* signal.
*/
outputs[i] = (node->filter.accum * node->filter.factor) >> 15;
outputs[i] = *node->filter.input - outputs[i];
break;
case SYNTH_NODE_MIXER: {
int32_t sum = 0;
for (int j = 0; j < 3; j++) {
if (node->mixer.inputs[j])
sum += *node->mixer.inputs[j];
}
outputs[i] = sum;
break;
}
default:
break;
}
if (node->gain) {
/* For now, apply a simple linear gain.
* Positive gain amplifies while negative gain attenuates the
* signal, ensuring compatibility with envelope generators.
*/
outputs[i] = (outputs[i] * *node->gain) >> 15;
}
}
/* Second iteration: update each node's state based on the computed
* outputs.
*/
for (int i = 0; i < SYNTH_NODES && nodes[i].type != SYNTH_NODE_NONE;
i++) {
synth_node_t *node = &nodes[i];
node->output = outputs[i];
switch (node->type) {
case SYNTH_NODE_OSCILLATOR:
/* Increment the phase by the base value and adjust for any
* detuning.
*/
node->state += *node->osc.phase_incr;
if (node->osc.detune)
node->state += *node->osc.detune;
node->state &= 0x7ffff; /* wrap around q15 */
break;
case SYNTH_NODE_ENVELOPE:
/* The highest bit in the state indicates decay mode
* (post-attack).
*/
if (voice->gate) {
/* When the gate is active, process attack followed by decay
*/
int mode_bit = node->state & 0x80000000;
int32_t value = node->state & 0x7fffffff;
if (mode_bit) {
/* Decay phase: decrement the state by the decay rate */
value -= node->env.decay;
const q15_t sustainAbs = node->env.sustain < 0
? -node->env.sustain
: node->env.sustain;
if (value < sustainAbs << 4)
value = sustainAbs << 4;
} else {
/* Attack phase: increment the state by the attack rate
*/
value += node->env.attack;
if (value > (int32_t) Q15_MAX << 4) {
value = (int32_t) Q15_MAX << 4;
/* Switch to decay mode once the peak is reached */
mode_bit = 0x80000000;
}
}
node->state = value | mode_bit;
} else {
/* When the gate is inactive, perform the release phase */
/* Clear the decay mode bit so that the next gate triggers a
* fresh attack.
*/
node->state &= 0x7fffffff;
node->state -= node->env.release;
if (node->state < 0)
node->state = 0;
}
break;
case SYNTH_NODE_FILTER_LP:
/* Update the accumulator for the low-pass filter based on the
* input difference.
*/
node->filter.accum += (*node->filter.input - node->output);
break;
case SYNTH_NODE_FILTER_HP:
/* Update the accumulator for the high-pass filter in a similar
* fashion.
*/
node->filter.accum += (*node->filter.input - node->output);
break;
case SYNTH_NODE_MIXER:
/* Mixer nodes do not require state updates in this pass */
break;
default:
break;
}
}
/* Accumulate the output from the primary node of the voice into the
* main output.
*/
main_output += voice->nodes[0].output;
}
const q15_t main_mixer_gain = Q15_MAX / SYNTH_VOICES;
return (main_output * main_mixer_gain) >> 15;
}
It's the heart of this synthesizer in c. It creates an audio sample running through every voice and every node.
For our song "Twinkle twinkle little stars", there're only 2 voices, defined by SYNTH_VOICES.
What's done each pass?
Synthesize Oscillator:
Synthesize Envelope:
Low‑Pass Filter: (smooths rapid changes)
Oscillator's phase is increased by its base pitch (phase_incr) plus any detune, then wrap (& 0x7FFF) so it loops back to zero at the end of a cycle.
Envelope:
Both Filters update the accumulator
Goals:
q15 and q7 are the formats for fixed point in this project. Just like mentioned above they represent 1.0 ~ -1.0
q15 has a resolution of 1/32768 or 3.05×10^−5
while 14 is only 1/128 or 7.8×10^−3
Why not use floats?
Interger operations are much quicker to do, bit shifts and such are also disallowed on floats. Also floats have a hard time representing integer numbers
But this also means that the smallest number that can be displayed IS the resolution of the fixed point structure. Factors like overflowing and underflowing must also be kept in mind
I didn't like how notes sounded, personally I prefer a crisper notes, kinda like a piano.
There are 2 voices
Voice 0:
synth_init_envelope_node(&voice->nodes[1], NULL, /* gain */
300, /* attack */
10, /* decay */
Q15_MAX * 0.2, /* sustain */
50 /* release */
);
...
synth_init_filter_lp_node(&voice->nodes[0], NULL, /* gain */
&voice->nodes[3].output, /* input */
10000 /* factor */
);
Voice 1:
synth_init_envelope_node(&voice->nodes[1], NULL, /* gain */
1000, /* attack */
200, /* decay */
Q15_MAX * 0.4, /* sustain */
15 /* release */
);
...
synth_init_filter_lp_node(&voice->nodes[0], NULL, /* gain */
&voice->nodes[2].output, /* input */
8000 /* factor */
);
For voice 0, I bumped the attack up and made the decay and release very quick. Voice 1's attack was bumped WAY up with an even smaller release.
Low pass filter for voice 0 was also increase to 10000.
...
if (note) {
/* Voice 0 plays the note as given, and Voice 1 plays two
* octaves lower.
*/
synth_voice_note_on(&synth_voices[0], note);
synth_voice_note_on(&synth_voices[1], note - 12);
}
...
Finally, instead of Voice 1 playing two octaves lower, it's only one.
const uint8_t twinkleTwinkleBeats[] = {
4, 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4, 4, 3, 5,
4, 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4, 4, 3, 5,
};
The length of notes was also tweaked a little bit, the original had this awkward pause between each line. This version remedies this.
Goal:
This is a simple chat program using socket and the "select()"" system call.
The code for this chat system is very self explanatory, so I'll just focus on parts I didn't initially understand.
struct sockaddr_in sin = {
.sin_family = AF_INET,
.sin_port = htons(port),
.sin_addr = {.s_addr = htonl(INADDR_ANY)},
};
This is the stucture storing an instance of a socket. SIN stands for Socket-INternet.
fd_set rfds;
FD_ZERO(&rfds);
FD_SET(server_fd, &rfds);
int max_fd = server_fd + 1;
while (select(max_fd, &rfds, NULL, NULL, NULL) >= 0) {
/* Check if the server socket has become readable, which indicates an
* incoming connection */
if (FD_ISSET(server_fd, &rfds)) {
...
fd_set is a data type defined in <sys/select.h>, it represents a set of file descriptors. Which select() can then use to check if any are ready for I/O operations
Takes in 5 parameters: nfds, readfds, writefds, exceptfds, timeout
nfds:
readfds:
writefds:
exceptfds:
timeout:
At NULL select() will wait indefinitely until a descriptor becomes ready
Behavior:
After returning, fd_set is updated too
When to use select()?
When handling multiple Connections, so instead of dedicating a thread or process, we can just use select(). This is especially useful for simple servers or applications where multithreading might add unnecessary complexity.
Efficiency:
select() has limited scalability, but it works well for small to moderate numbers of connections and is widely supported across different operating systems
Looking at the tic-tac-toe program example, we can see an interesting output:
Despite just being a part of just a terminal, there's a background and colored characters, effectively simulating UI elements.
It is indeed magical! But it isn't magic.
'\033' // Escape character
This character tells the following characters are commands and not just plain text, this happens until reaching the character 'm'.
'[' // Control Sequence Introducer
This character then indicates that a Control Sequence command is coming
So for example:
"[47" sets the background to white,
"[31" sets the foreground text to red.
"[0" will reset the formatting si subsequent text aren't affected.
30: Black
31: Red
32: Green
33: Yellow
34: Blue
35: Magenta
36: Cyan
37: White
printf("\033[47m\033[31mX\033[0m");
So all together a line like this will print out:
(A red X on a white background)
$ gcc chat.c -o chat -Wall -Wextra
Adding -Wall and -Wextra when compiling enables some common checks that check for potential issues that may be present in your code. Alerting you about potential issues.
$ ./chat 8080 // choose your own port
! ! Choose a port that isn't being used by other applications on the same network ! !
Opening up another terminal and connecting with:
nc localhost 8080
Since both terminals are running on litterally the same device, localhost will get the correct IP.
Goal:
Each client should have a distict color on a white background.
int client_colors[FD_SETSIZE] = {0};
int user_cnt = 0;
// Using an array to keep track of each client's color code
...
conns[new_fd] = 1;
// assign a color to this user
client_colors[new_fd] = 30 + user_cnt++ % 7;
After recieving a valid connection, a color is assigned to the user.
...
else if (nread > 0) {
// set the last character to '\0' instead of '\n'
buf[nread-1] = '\0';
char colored_msg[1024 + 20]; // extra space for escape characters
// snprintf returns the len of new string, negating the need for strlen() later on
int colored_len = snprintf(colored_msg, sizeof(colored_msg), "\033[47m\033[%dm%.*s\033[0m", client_colors[fd], nread, buf);
printf("[%d] read: %.*s\n", fd, colored_len, colored_msg);
/* loop over all our connections, and send stuff onto them!*/
for (int dest_fd = 0; dest_fd < FD_SETSIZE; dest_fd++) {
/* take active connections, but not ourselves */
if (conns[dest_fd] && dest_fd != fd) {
/* write to them */
if (write(dest_fd, colored_msg, colored_len) < 0) {
/* disconnect if it fails; they might have
* legitimately gone away without telling us */
fprintf(stderr, "write(%d): %s\n", dest_fd,
strerror(errno));
close(dest_fd);
conns[dest_fd] = 0;
}
}
}
...
Connection accross systems:
Windows ⬇️:
Ubuntu ⬇️:
Testing different connection colors:
File chat.c remains largely unchanged, most new features are added by new files "login.c" and "login.h"
if (new_fd < 0) {
perror("accept");
} else {
printf("[%d] connect from %s:%d\n", new_fd,
inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
// Welcome message
client_printer(new_fd);
// Show home menu
int user_index = client_Homemenu(new_fd);
if (user_index == -1) {
close(new_fd);
continue;
}
int onoff = 1;
if (ioctl(new_fd, FIONBIO, &onoff) < 0) {
printf("fcntl(%d): %s\n", new_fd, strerror(errno));
close(new_fd);
continue;
}
conns[new_fd] = 1;
// [CHANGE]: Color is now determined by the user’s account index.
client_colors[new_fd] = 30 + (user_index % 7);
}
client_printer sends the client a welcome message, client_Homemenu shows basically the homescreen.
#ifndef LOGIN_H
...
#endif
This "if" definition makes sure that inclusions of a header can only happen ONCE
The logic of this user system and "database"(it's quite a stretch to call this a database) is quite simple. I tried to keep it simple as the scale of this chatroom is quite small, it only needs to do it's job.
So fancy features like encryption for passwords on database, or hashmaps. Especially no red-black trees or B+ trees or sql. BUT I will do it if the goal was to implement a login chatroom that supports tens of thousands of users.
static void client_send(int client_fd, const char *msg) {
write(client_fd, msg, strlen(msg));
}
static int client_readline(int client_fd, char *buffer, int maxlen) {
int total = 0;
while (total < maxlen - 1) {
char c;
int n = read(client_fd, &c, 1);
if (n <= 0) {
return n; // error or disconnect
}
if (c == '\n')
break;
buffer[total++] = c;
}
buffer[total] = '\0';
return total;
}
client_send handles sending messenges to the client, while client_readline reads their input.
client_readline reads characters one by one, while may be a bit slow, it does prevent injections from overflowing strings.
int size = 0;
int db_max = 10;
char** user_db;
char** pswd_db;
void init_db(){
user_db = (char**)malloc(sizeof(char*) * db_max);
pswd_db = (char**)malloc(sizeof(char*) * db_max);
for (int i = 0; i < db_max; i++){
user_db[i] = (char*)malloc(MAX_USERNAME_LENGTH);
pswd_db[i] = (char*)malloc(MAX_PASSWORD_LENGTH);
}
}
void expand_db(){
int new_size = db_max * 2;
user_db = (char**)realloc(user_db, sizeof(char*) * new_size);
pswd_db = (char**)realloc(pswd_db, sizeof(char*) * new_size); // [CHANGE]: fixed reallocation for pswd_db
for (int i = db_max; i < new_size; i++){
user_db[i] = (char*)malloc(MAX_USERNAME_LENGTH);
pswd_db[i] = (char*)malloc(MAX_PASSWORD_LENGTH);
}
db_max = new_size;
}
The quote on quote "database" is just 2 arrays of size db_max. If there are ever too many users, the size is then doubled to accommodate.
Loop through both arrays allcating memory.
// constants defined in login.h
#define MAX_USERNAME_LENGTH 15
#define MAX_PASSWORD_LENGTH 12
static bool check_password(int index, const char* password){
return strcmp(pswd_db[index], password) == 0;
}
static int find_account(const char* username){
for (int i = 0; i < size; i++){
if (strcmp(user_db[i], username) == 0)
return i;
}
return -1;
}
bool create_account(const char* username, const char* password){
if (find_account(username) != -1)
return false; // account already exists
if (size >= db_max)
expand_db();
strcpy(user_db[size], username);
strcpy(pswd_db[size], password);
size++;
return true;
}
bool del_account(const char* username){
int index = find_account(username);
if (index == -1)
return false;
for (int i = index; i < size - 1; i++){
strcpy(user_db[i], user_db[i + 1]);
strcpy(pswd_db[i], pswd_db[i + 1]);
}
size--;
return true;
}
I believe these 4 functions are pretty self explanatory.
Accounts are sorted by the time of creation
once deleted, every indexed account afterwards will just be rolled forward once.
int client_Homemenu(int client_fd){ // [CHANGE]: Uses client_fd for I/O.
char choice[10];
while (1) {
client_send(client_fd, "=============================================\n");
client_send(client_fd, "Welcome to BASIC CHATROOM, please enter:\n");
client_send(client_fd, " (1) to login\n");
client_send(client_fd, " (2) to create an account\n");
client_send(client_fd, " (3) to exit\n");
client_send(client_fd, "Choice: ");
if (client_readline(client_fd, choice, sizeof(choice)) <= 0)
return -1;
if (choice[0] == '1') {
int index = client_loginmenu(client_fd);
if (index != -1)
return index;
} else if (choice[0] == '2') {
client_createmenu(client_fd);
} else if (choice[0] == '3') {
client_send(client_fd, "Goodbye!\n");
return -1;
} else {
client_send(client_fd, "Invalid choice. Please try again.\n");
}
}
}
This is the homescreen, it shows some simple text and prompts for an input. Which then shows the corresponding screen.
The rest of the functions really are quite simple, so I'll just leave the github here for those interested.
The standard way to get the ping of a user, is to basically send a message or "ping" there, log the time, then await for a "pong" response back. The time difference inbetween is the ping.
But since we're using netcat for connecting, we can't code anything to the client's side, so there's no way to get the client send a "pong" signal automatically when we send "ping".
Next best thing: commands
When the user's logged in, any text that starts with '/' will be a command and won't be broadcasted to other users.
/hello -> greets the user with "Why hello!"
/online -> return a list of every user that's currently online
Possible issues I randomly thought of: