Try   HackMD

2025q1 Homework4 (quiz3+4)

contributed by < BigMickey69 >

quiz1

測驗一

目標:

  • 解釋上述程式碼運作原理
  • 這項實作依賴遞迴呼叫,請避免使用並降低記憶體開銷
  • 改進最大公因數的運算效率

第一件事是將程式碼整合成一 .c 檔,再將空格 AAAA 到 EEEE 全數填完後,再補上缺的 c 標頭檔。幾週前的小考因此容易忘記空格答案,執行後有錯就更動,反覆操作直到 debug 完成。

教授幽默人又好,提供的程式碼很貼心的已經附帶每個函式的 testcase ,若跑不過就能知道有問題

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

通過截圖

為什麼要使用 mpi ?
c 內建的預設整數 int 只有32位元,因此只能表示2147483647 ~ –2147483648 的數字,即便用 unsigned 最大也只能表示 4294967295 。

若希望表示更大的整數則可使用 long long ,一共有64個位元可運用,將可表示整數的範圍擴充到 -2^31 to 2^31 – 1,但這樣還是會不夠用,而且如果每個整數都用 long long 來表示,則會造成很多不必要的記憶體空間浪費

mpi 全名 multi-precision integers,自訂的整數表示型別,利用 uint32_t 將記憶體分割成32位元的等分,而每格只使用31位元儲存資訊,需要時擴充、不必要時縮小空間,可有效率且高兼容的做整數運算

不過最大的缺點是速度比上內建整數運算慢了好幾倍

原理:

定義一新 struct 叫 mpi_t,裡面有一 uint32_t 指標來儲存資料、一 size_t 來紀錄大小
(我自己會把 size_t 改用 short int 來進一步節省記憶體,不過這樣或許會慢一些,每次進行操作時處理器得額外花時間將 short int 擴充至32位元)

typedef struct {
    uint32_t *data;
    size_t capacity;
} mpi_t[1];

寫 [1] 的好處是寫程式時不必在每次傳入 mpi_t 時使用 & 符號,矩陣本身就是個指標,而在創建此物件時編譯器會自動分配一點記憶體過去,省下 malloc 指標的必要,為寫程式的一種小技巧

雖然 data 指標中每格為32位元,但為了避免 overflow 的發生而書改到違法的記憶體,32位元中只會使用31為原來儲存資料

各函式作用:

  • mpi_init 初始化元素,mpi_clear 釋放記憶體。

  • mpi_enlarge 使用 realloc 將擴充元中 data 指標配置的記憶體量到指定大小,並將新的記憶體設為 0

  • mpi_compact 從 capacity 由後往前看 data 的每一格,若為 0 則釋放並紀錄新大小

  • mpi_set_u32 將 uint32_t 的整數存為 mpi_t 元素。如前面所述,因 uint32_t 為32位元而 data 中每格只能儲存 31 位元,因此還是得用 2 格來儲存資料。先存下31位元後再將數字向右移動 >>= 32 位元

  • mpi_set_u64 與 mpi_set_u32 同理但 data 會使用到 3 格,因 ceil_div(64,31) = 3

  • mpi_get_u64 將 mpi_t 轉為 uint64_t ,若大小超出 64 位元則無條件捨去掉,每次儲存31位元進 uint64_t 後再往左 bitshift <<= 31 位元繼續儲存

  • mpi_get_u32 與 64 同理不過會在 32 位元時就無條件捨去

#define INTMAX 0x7fffffff
// 只有 MSB 為1、其餘為0,為 mpi_t 中31位元的一格能表示的最大整數
  • mpi_add 實現兩 mpi_t 之間的加法運算,capacity 取兩數較大的(假設 n )。從n開始,取數字1與數字2中那個的31位元出來,儲存在 uint32_t 之中,兩數與變數 c 做家法,並紀錄溢出的部份於變數 c ,最後再將答案 AND INTMAX 來確保 MSB 為0,操作至第一格

  • mpi_sub 與 mpi_add 同理不過改為減法,一樣有變數 c來紀錄 carry

  • mpi_add_u64 將 mpi_t 元素加上 uint64_t,其實邏輯上與 mpi_add 一樣,只是會用 INTMAX 來 mask 前31位元以外的所有位元(當作是一格 data 操作),一次加法結束後再向右平移31位元,反覆操作

    Image Not Showing Possible Reasons
    • The image was uploaded to a note which you don't have access to
    • The note which the image was originally uploaded to has been deleted
    Learn More →

  • mpi_sub_u32 與 mpi_add_X 雷同,不過改用減法(注意:不支援負數!)

  • mpi_mul_naive 實現兩 mpi_t 之間的乘法運算,至於為什麼是 naive (天真),是因為邏輯上就跟我們人在紙上用的直式一樣,隨著數字越來越大,計算量會迅速膨脹(時間複雜度:O(n^2))。

  • mpi_mul_karatsuba 同樣計算兩 mpi_t 之間的乘法,但時間複雜度只有 ~O(n^1.585) 。卡拉楚巴演算法利用divide‑and‑conquer(分治)的方法,它比傳統的 O(n²)「長乘法」更快,運算複雜度降到約 O(n^1.585)。

    • 把問題分成三部份,前、中、後,若還是太大則繼續分割、處理,將大題不斷切成多個小問題:
      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →

      Image Not Showing Possible Reasons
      • The image was uploaded to a note which you don't have access to
      • The note which the image was originally uploaded to has been deleted
      Learn More →

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

非常糟糕的手繪示意圖

  • 較大的 capacity 平均分成兩半,
    • 下半部 (x0, y0):數字的下 31×m bits

    • 上半部 (x1, y1):數字的上31×m bits

    • 再用兩個 helper 函式協助, mpi_fdiv_r_2exp(, 31*m) 取餘數,mpi_fdiv_q_2exp(, 31*m) → 取商

名稱 運算 含義
z2 x1 × y1 上半部×上半部
z0 x0 × y0 下半部×下半部
z1 (x1+x0) × (y1+y0) 合併後乘積
​​​​z1 扣掉 z2 和 z0 為中間項
  • mpi_cmp 比較兩 mpi_t 元素的大小,若 capacity 不相同則很好比較,相同的話再將最左邊的31位元抓出來存進32位元整數再進行比較

  • mpi_setbit 利用mask主動顛倒指定位置的位元,1U 往左 <<= 指定位元數創建 mask 再用 OR 操作

  • 最後,mpi_gcd 為利用 gcd 中的輾轉相除法與遞迴找出最大公因數

改善記憶體使用、更概 gcd 使其不依賴遞迴

使用遞迴的函式只有 mpi_mul_karatsuba 與 mpi_gcd 兩個,若要mpi_mul_karatsuba 不使用遞迴會非常的複雜,得用好幾個堆疊 stack 去紀錄每個代做操作,要碼拆分要碼合併,紀錄每個要操作的元素與要操作的 data 部份、做 bit shift 並且記著原先的數值等等等

不管是速度上還是記憶體,感覺都不會勝於遞迴的版本,因此這邊專注於 gcd 函式的優化

mpi_gcd2:

void mpi_gcd2(mpi_t rop, const mpi_t op1, const mpi_t op2)
{
    if (op2->capacity != 0 && mpi_cmp_u32(op2, 0) == 0) {
        mpi_set(rop, op1);
        return;
    }

    mpi_t dividend, divisor, r;
    mpi_init(dividend);
    mpi_init(divisor);
    mpi_init(r);

    mpi_set(dividend, op1);
    mpi_set(divisor, op2);

    while(divisor->capacity != 0 && mpi_cmp_u32(divisor, 0) != 0){
        mpi_fdiv_r(r, dividend, divisor);
        //printf("%d / %d = eh ... %d\n", 
            //dividend->data[0], divisor->data[0], r->data[0]);
        mpi_set(dividend, divisor);
        mpi_set(divisor, r);

        //printf("new %d / %d\n", dividend->data[0], divisor->data[0]);
    }
    mpi_set(rop, dividend);
    //printf("final: %d\n", rop->data[0]);
    mpi_clear(dividend);
    mpi_clear(divisor);
    mpi_clear(r);
}

邏輯簡單,divisor 等於0之前不斷的輾轉相除。新增了一新 helper 函式 mpi_fdiv_r() ,邏輯上與 mpi_fdiv_qr() 一樣不過不會設定商(q)

遇到問題:

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

老朋友 Segmentation fault。
利用 printf 找問題,研究後發現問題出在這行:

while(mpi_cmp_u32(divisor, 0) != 0) { ...

mpi_fdiv_r(r, dividend, divisor); 似乎會在 r 為 0時將它設為 NULL ,這樣確實能省下一些記憶體,但就是有點怪,尤其是因為在原 gcd 函式中也有用 mpi_cmp_u32 來與 0 比對

多一部分偵測 capacity 是否為0就解決了

測驗二

從現在開始我會用英文撰寫共筆

Goal:

  1. 解釋上述程式碼運作原理
  2. 比較 Linux 核心原本 (與平台無關) 的實作和 memchr_opt,設計實驗來觀察隨著字串長度和特定 pattern 變化的效能影響
  3. 研讀 2022 年學生報告,在 Linux 核心原始程式碼找出 x86_64 或 arm64 對應的最佳化實作,跟上述程式碼比較,並嘗試舉出其中的策略和分析

下一步:貢獻程式碼到 Linux 核心
Phoronix 報導

First we must understand what exacty IS SIMD and SWAR

SIMD (Single Instruction Multiple Data),as opposed to SISD (Single Instruction Single Data) is when we process multiple data just with a singular instruction

Simply, SIMD is where you perform the same operation on multiple data in parallel.

Let's take taking the square of elements in an array as an example:

// Serial
void square(int *arr, size_t n) {
    for (size_t i = 0; i < n; ++i)
        arr[i] = arr[i] * arr[i];
}

// SIMD
#include <immintrin.h>

void simd_square_int(int *arr, size_t n) {
    size_t i = 0;
    
    // using AVX2 to process eight 32‑bit ints per __m256i
    const size_t stride = 8; 

    for (; i + stride <= n; i += stride) {
        __m256i vec = _mm256_loadu_si256((__m256i*)(input + i));
        __m256i squared = _mm256_mullo_epi32(vec, vec);
        _mm256_storeu_si256((__m256i*)(output + i), squared);
    }

    // Scalar fallback for remaining elements
    for (; i < n; i++) {
        arr[i] = arr[i] * arr[i];
    }
}

AVX2:

AVX2 (Advanced Vector Extensions 2) is an x86 CPU instruction‑set extension that provides 256‑bit SIMD support. Supporting gather/scatter memory operations and vector arithmetics, logical, and comparison instructions. Allowing a single instruction to process up to eight 32‑bit values in parallel, reducing instruction count and boosting throughput for data‑parallel tasks.

:::

A very basic concept is using a single-lane as multi-lanes. ex, using 16 bits as four 4 bit lanes, let's say we wanted to check if two sets of four 4 bit intergers/char are equal:
(SWAR example)

A: 0001 | 0010 | 0011 | 0100
B: 0100 | 0010 | 1011 | 0111

Step 1: A XOR B (save to A)

A -> 0101 | 0000 | 1000 | 0011

Lanes that are all 0 indicate a match

Step 2: Set new bits whose MSB of each lane is

B -> 1000 | 1000 | 1000 | 1000

Step 3: A OR B, making all MSB 1

A -> 1101 | 1000 | 1000 | 1011

Step 4: subtract 1 from each lane

A: 1101 | 1000 | 1000 | 1011
-0001 | 0001 | 0001 | 0001
= 1??? | 0111 | 0??? | 1???
Possible misjudge on lane 3

Step 5: NOT all bits of A, A AND initial A

  • Cleaning up results
    A -> 0??? | 1000 | 1??? | 0???
    -0101 | 0000 | 1000 | 0011
    = 0??? | 1000 | 0??? | 0???

RESULT: if MSB of said lane is 1, then it's a match

Linux kernal's original implementation of memchar

Linux source code

void *memchr(const void *s, int c, size_t n)
{
	const unsigned char *p = s;
	while (n-- != 0) {
        	if ((unsigned char)c == *p++) {
			return (void *)(p - 1);
		}
	}
	return NULL;
}

This code does the job in a more tradional way, it loops through the array and checks each 4 bits whether it's equal to varaible c past in the function. Though it is clever with how it returns the address of the match.

// SWAR approach
void *memchr_swar(const void *s, int c, size_t n) {
    const unsigned char *p = s;
    unsigned char uc = (unsigned char)c;

    // SWAR processes are slow if not byte aligned, so we'll check the traditional way
    while (n > 0 && ((uintptr_t)p % sizeof(size_t)) != 0) {
        if (*p == uc)
            return (void *)p;
        p++;
        n--;
    }

    if (n >= sizeof(size_t)) {
        size_t repeated = uc;
        repeated |= repeated << 8;
        repeated |= repeated << 16;
        if(SIZE_MAX == UINT64_MAX)
            repeated |= repeated << 32;

        // Magic constants!
        size_t ones, highs;
        if (sizeof(size_t) == 8) {
            ones  = 0x0101010101010101ULL;
            highs = 0x8080808080808080ULL;
        } else {
            ones  = 0x01010101U;
            highs = 0x80808080U;
        }

        while (n >= sizeof(size_t)) {
            size_t word = *(const size_t *)p;
            // XOR to find match, 0000 if a match
            size_t diff = word ^ repeated;

            // SWAR zero-detection
            if (((diff - ones) & ~diff & highs) != 0) {
                // find actual location if there's a match
                for (size_t i = 0; i < sizeof(size_t); i++) {
                    if (p[i] == uc)
                        return (void *)(p + i);
                }
            }
            p += sizeof(size_t);
            n -= sizeof(size_t);
        }
    }

    // Process any remaining bytes.
    while (n--) {
        if (*p == uc)
            return (void *)p;
        p++;
    }
    return NULL;
}
}

We check in 32 bit chunks(compiling in 32-bit) for matching unsigned char (8 bits).
After making sure that bytes are aligned (impacts efficiency greatly if not), we use logic from earlier but streamlined to check whether a match exists within 32 bits.

If there are any remaining bytes after SWAR, we make sure to also check them through the tried and true way.

Testing performance: memchr v.s. memchr_SWAR

We'll go through 100,000 iterations, with each pointer's minimum size at 1024 (1KiB), maximum at 102410248 (8MiB), slowly getting larger each iteration.

Randomly fill said arrays using rand(), with each iterations set the target at the end.

​​​​e.g. int target = buffer[size - 1]

Finally, we write all the data into .csv form and graph said data using matplotlib:
Figure_2
We can see that memchr_SWAR is a on average much more stable and faster than memchr, but we can't really see a steady increase in time or performance with the increase in buffer size. There may be a bit of noise in these results, though I will point out that both functions are ran in the same environment, so it's still a fair compaison giving us an idea of the performance difference.

size, memchr, memchr_SWAR
...
1795072,0.007167,0.002484
1803264,0.080287,0.017096
1811456,0.079536,0.017788
1819648,0.069977,0.012703
...
Looking at the .csv file manually, you'd find that almost every
record of memchar_SWAR is ~3x to 5x faster than memchr

Average memchr: 0.057593 seconds
Average memchr_SWAR: 0.011527 seconds

補充: 寫完才發現跟 quiz 中的程式碼幾乎一樣,但確實是自己拼湊 + 查資料出來的結果

測驗三

The first step of any assignment is too comprehend the code at hand, but too often I forget to log the steps I took to try and understand the code.
Since there's no additional assignments for this quiz, I'll walk through my process through enlightenment:

Env is the environment, storing info regarding the current thread. While ucontext_t contains execution context.

the line

enum { ENV_UNUSED, ENV_RUNNABLE, ENV_WAITING };

declares 3 anonymous enumerations.

My question upon seeing this is "What's the point?, Why not just use macros?"

My guess of why: Declaration can be done in just 1 line, automatic numbering, Scoped declaration

static int curenv;

The variable "curenv" is a global variable that keeps track of what Env is currently running.

static ucontext_t exiter = {0};

This declares a new ucontext_t object while quickly setting numbers to 0, pointers to NULL. (since NULL and 0 are the same thing in assembly)

const struct itimerspec ts = {
    {0, 0},
    {0, 10000000},
};

It's structure is defined as so:

struct timespec {
    time_t tv_sec; // tv stands for "time value"
    long   tv_nsec;  
};
// it stands for "interval timing"
struct itimerspec {
    struct timespec it_interval; // interval between firing
    struct timespec it_value;    // time before first fire
};

So this means that ts doesn't repeat, fires after 10 ms

mmap():

Stands for memory map. It's a system call that maps files or anonymous memory into a process's virtual address space, giving direct access to processes in user space.

...
// AAAA: 
envs[env].status = ENV_RUNNABLE;
...
    
... 
// BBBB
int candidate = (curenv + 1) % NENV;
...
    
...    
// CCCC
if (envs[curenv].state_reentered++ == 0) {
        /* Context successfully saved; schedule the next user-level thread to
         * run.
	 */
        coro_schedule();
    }
...
    
...
// DDDD
static void preempt(int signum UNUSED,
                    siginfo_t *si UNUSED,
                    void *context UNUSED)
{
    coro_yield();
}
...

sigaction structure

struct sigaction {
    void (*sa_handler)(int);
    void (*sa_sigaction)(int, siginfo_t *, void *);
    sigset_t sa_mask;
    int sa_flags;
    void (*sa_restorer)(void);
};

Explained by the prof in the section before the quiz.

quiz2

測驗一

CRC or Cyclic Redundancy Check is a common way to check if data is intact, most commonly seen in web/server data applications.

First time seeing it I had no idea what I was looking at, the code and the logic just didn't make any sense in my head. Upon revisiting the concept for week 4's assignment however, I already have a better grasp.

// naive implementaion
uint32_t crc32_naive(uint32_t crc, uint8_t v)
{
    crc ^= v;
    for (int bit = 0; bit < 8; bit++) {
        if (crc & 1)
            crc = (crc >> 1) ^ UINT32_C(0x82f63b78);
        else
            crc = (crc >> 1);
    }
    return crc;
}

The first thing to keep in mind is that the bits are reversed in this application. There are a few benefits that comes with this:

  1. bit ops on LSB are faster
  2. less overall operations are needed

Let's say we're processing 8 bits of data like the example above, xor happens and bit shifts happen only 8 times. Thus really, only the first 8 bits of CRC actually influenced the result.

This 2^8 = 256, thus we can use a lookup table to negate the need for calculating all together.

static const uint32_t crc32c_table[256] = {
    0x00000000L, 0xF26B8303L, 0xE13B70F7L, 0x1350F3F4L,
    ...
    0xBE2DA0A5L, 0x4C4623A6L, 0x5F16D052L, 0xAD7D5351L
};

uint32_t crc32c(uint32_t crc, const uint8_t *data, unsigned int length)
{
    while (length--) {
        crc = crc32c_table[(crc ^ *data++) & 0xFFL] ^ (crc >> 8);
    }
    return crc ^ 0xffffffff;
}

crc32c does crc on every byte of data, while examples earlier only focused on 1 byte.

Since the bits are reversed, we XOR with 0xFFL at the end to reverse the result to grab the index for the lookup table.

uint32_t crc32_u8(uint32_t crc, uint8_t v)
{
    crc ^= v;
    static const uint32_t crc32_table[] = {
        0x00000000, 0x105ec76f, 0x20bd8ede, 0x30e349b1, 
        0x417b1dbc, 0x5125dad3, 0x61c69362, 0x7198540d, 
        0x82f63b78, 0x92a8fc17, 0xa24bb5a6, 0xb21572c9, 
        0xC38D26C4, 0xD3D3E1AB, 0xE330A81A, 0xF36E6F75
    };
    
    crc = (crc >> 4) ^ crc32_table[crc & 0x0F];
    crc = (crc >> 4) ^ crc32_table[crc & 0x0F];
    return crc;
}

For a half-byte lookup table, we just change the implementation from checking 8 bits at once -> checking 4 bits, twice.

Goal:

  1. 在 Linux 核心原始程式碼中找到 CRC32 的應用案例至少 3 處,並探討其原始程式碼
  2. 設計實驗來分析 Fast CRC32 提出的手法 (要涵蓋 branchless 和查表的變形),並與硬體的 CRC32 指令效能進行比較。
  3. 以 x86-64 或 Arm64 為探討標的,說明 Linux 核心如何使用硬體加速的 CRC32 指令

1. Linux 中 CRC32 的應用案例

Q. How do we find exactly where crc32 is used in the Linux kernal?
A. Straight up download the kernal and search it through!

Steps took:

bigmickey@Linuxer:~$ sudo apt-get install linux-source
bigmickey@Linuxer:~$ cd /usr/src
bigmickey@Linuxer:~$ sudo tar -xf linux-source-6.8.0.tar.bz2

-> When we downloaded "linux-source", it downloaded a compressed file that contains everything. Thus we must use a tool named "tar" to unpack everything

bigmickey@Linuxer:~$ cd linux-source-6.8.0

Finally, the we seach through the files using the command grep (global regular expression print) for "crc32"

bigmickey@Linuxer:/usr/src/linux-source-6.8.0$ grep -Rn "crc32" kernel/

Output screen(partial):
Screenshot from 2025-03-27 21-57-54

With this, we're able to find the exact files and places where crc32 is used.

hid-playstation.c

(The name playstation caught my attention)
What is it?:

  • Under drivers/, it's a driver in linux kernal, as the name suggests it's the driver for playstation controllers
  • Translates HID reports(raw input data from USB/Bluetooth) into Linux input events
  • Does LED light control
  • Vibration motors
  • Reports battery status
  • Touchpad/gyrometer support and more!

So how is CRC32 used?
CRC32 is used mainly to verify data integrity
of Bluetooth for the controller

struct dualshock4_input_report_bt {
	uint8_t report_id; /* 0x11 */
	uint8_t reserved[2];
	struct dualshock4_input_report_common common;
	uint8_t num_touch_reports;
	struct dualshock4_touch_report touch_reports[4]; /* BT has 4 compared to 3 for USB */
	uint8_t reserved2[2];
	__le32 crc32;
} __packed;

Many structures like this one ARE the structures of data that's transmitted wirelessly through bluetooth, each one containing CRC32. The function ps_check_crc32() checks if CRC is correct, if not the input is ignored. (return -EILSEQ)

Source code

fastmap.c

What is it?:

  • Located at drivers/mtd/ubi/fastmap.c, so it's a driver as well, on top of MTD (Memory Technology Devices) & UBI (Unsorted Block Images). Very basically, it's a part of Linux kernel's support for raw flash memory, especially NAND flash chips
  • UBI manages wear-leveling, bad block handling, and provides a volume abstraction, while Fastmap is a performance optimization for UBI.
  • Without Fastmap, UBI has to scan the entire flash during mount to load all internal metadata
  • Fastmap stores a snapshot of UBI's internal metadata (mapping of eraseblocks, etc.) in a compact format

CRC32 is used to verify data integrity, detecting corruption in Fastmap data when it is read from flash.

Since flash devices (e.g. 隨身碟) are prone to bit flips, partial writes, or wear and tear corruption

...
    crc = crc32(UBI_CRC32_INIT, ubi->fm_buf, fm_size);
    if (crc != tmp_crc) {
        ubi_err(ubi, "fastmap data CRC is invalid");
        ubi_err(ubi, "CRC should be: 0x%x, calc: 0x%x",
            tmp_crc, crc);
        ret = UBI_BAD_FASTMAP;
        goto free_hdr;
    }
...

Source code

blockcheck.c

What is it?

  • It's a part of fileSytem (fs), under OCFS2 (Oracle Cluster File System version 2.), multiple nodes can access the same disk simultaneously
  • Checks storage blocks to see detect whether any block corruption has occurred

Q. If CRC32 is used on litterally every block, wouldn't that takle too much computing and storage?
A. Yes, for the sake of striking a balance with efficiency, only metadata blocks are checked.

// One of the functions
int ocfs2_block_check_validate(void *data, size_t blocksize,
			       struct ocfs2_block_check *bc,
			       struct ocfs2_blockcheck_stats *stats)
{
	int rc = 0;
	u32 bc_crc32e;
	u16 bc_ecc;
	u32 crc, ecc;

	ocfs2_blockcheck_inc_check(stats);

	bc_crc32e = le32_to_cpu(bc->bc_crc32e);
	bc_ecc = le16_to_cpu(bc->bc_ecc);

	memset(bc, 0, sizeof(struct ocfs2_block_check));

	/* Fast path - if the crc32 validates, we're good to go */
	crc = crc32_le(~0, data, blocksize);
	if (crc == bc_crc32e)
		goto out;

	ocfs2_blockcheck_inc_failure(stats);
	mlog(ML_ERROR,
	     "CRC32 failed: stored: 0x%x, computed 0x%x. Applying ECC.\n",
	     (unsigned int)bc_crc32e, (unsigned int)crc);

	/* Ok, try ECC fixups */
	ecc = ocfs2_hamming_encode_block(data, blocksize);
	ocfs2_hamming_fix_block(data, blocksize, ecc ^ bc_ecc);

	/* And check the crc32 again */
	crc = crc32_le(~0, data, blocksize);
	if (crc == bc_crc32e) {
		ocfs2_blockcheck_inc_recover(stats);
		goto out;
	}

	mlog(ML_ERROR, "Fixed CRC32 failed: stored: 0x%x, computed 0x%x\n",
	     (unsigned int)bc_crc32e, (unsigned int)crc);

	rc = -EIO;

out:
	bc->bc_crc32e = cpu_to_le32(bc_crc32e);
	bc->bc_ecc = cpu_to_le16(bc_ecc);

	return rc;
}

ECC -> Error-Correcting Code.
Simplified steps for the function:

  1. CRC32 and ECC of bc
  2. set bc to all 0
  3. generate the CRC32 for data
    if it succeeds → check!
    If not, log the failure and try to use ECC to restore the data

Source code

2. 設計實驗來分析

for (unsigned int j = 0; j < 8; j++)
    crc = (crc >> 1) ^ (-(int)(crc & 1) & Polynomial);

By smartly utilizing bit ops, we're able to negate the need of an "if" branch OR "tenary" operators.
Logic:

  1. AND 1 extracts the LSB of crc
  2. Casting to int and putting "-"
  3. if (crc & 1) == 0, then the result is 0
  4. if (crc & 1) == 1, then the result is -1, which as int in bit representation is 0xFFFFFFFF, making all bits 1.
  5. AND with all 1s does nothing, allowing the result through and XOR operation to go through

This is a very fast way to calculate CRC without using lookup tables

Before getting into data, I want to break down one of the algorithms provided that caught my eye:

const uint32_t Polynomial = 0xEDB88320;
// -----------------------------------------------
// standard initialization
// approx. 1.5 nanoseconds on my PC / GCC9-x64
for (unsigned int i = 0; i <= 0xFF; i++)
{
  uint32_t crc = i;
  for (unsigned int j = 0; j < 8; j++)
    crc = (crc >> 1) ^ ((crc & 1) * Polynomial);
  Crc32Lookup[0][i] = crc;
}
// -----------------------------------------------
// code by a user named "Wolf"
// approx. 0.22 nanoseconds on my PC / GCC9-x64
Crc32Lookup[0][0] = 0;
// start with powers of two
uint32_t crc = Crc32Lookup[0][0x80] = Polynomial;
for (unsigned int step = 0x40; step != 0; step >>= 1)
{
  // each power of two has only a single bit set
  // so we need just one iteration
  crc = (crc >> 1) ^ ((crc & 1) * Polynomial);
  Crc32Lookup[0][step] = crc;
  // and compute all not-yet-processed multiples
  for (unsigned int i = step * 3; i <= 0xFF; i += step << 1)
    Crc32Lookup[0][i] = crc ^ Crc32Lookup[0][i - step];
  // note: step*2 was already filled, therefore start at step*3
  //       and compute all odd multiples
}
// -----------------------------------------------
// my improvement of Wolf's code
// approx. 0.18 nanoseconds on my PC / GCC9-x64
Crc32Lookup[0][0] = 0;
// compute each power of two (all numbers with exactly one bit set)
uint32_t crc = Crc32Lookup[0][0x80] = Polynomial;
for (unsigned int next = 0x40; next != 0; next >>= 1)
{
  crc = (crc >> 1) ^ ((crc & 1) * Polynomial);
  Crc32Lookup[0][next] = crc;
}
// the main idea is:
// table[a ^ b] = table[a] ^ table[b];
// therefore if we know all values up to a power of two called x
// we can compute table[x + y] where x > y:
// table[x + y] = table[x ^ y] = table[x] ^ table[y]
// example:
// the previous loop computed table[1] and table[2],
// so that x = 2 (a power of two) and y = 1 (fulfilling x > y)
// now x + y = x ^ y = 2 + 1 = 2 ^ 1
// and table[x + y] = table[x ^ y] = table[x] ^ table[y]
// which means table[3] = table[2] ^ table[1]
// compute all values between two powers of two
// i.e. 3, 5,6,7, 9,10,11,12,13,14,15, 17,...
for (unsigned int powerOfTwo = 2; powerOfTwo <= 0x80; powerOfTwo <<= 1)
{
  uint32_t crcExtraBit = Crc32Lookup[0][powerOfTwo];
  for (unsigned int i = 1; i < powerOfTwo; i++)
    Crc32Lookup[0][i + powerOfTwo] = Crc32Lookup[0][i] ^ crcExtraBit;
}

Breaking it down step by step:

// Normal way:
for (unsigned int i = 0; i <= 0xFF; i++) {
  uint32_t crc = i;
  for (unsigned int j = 0; j < 8; j++)
    crc = (crc >> 1) ^ ((crc & 1) * Polynomial);
  Crc32Lookup[0][i] = crc;
}

// Wolf's way
Crc32Lookup[0][0] = 0;
uint32_t crc = Crc32Lookup[0][0x80] = Polynomial;
for (unsigned int step = 0x40; step != 0; step >>= 1) {
  crc = (crc >> 1) ^ ((crc & 1) * Polynomial);
  Crc32Lookup[0][step] = crc;
  for (unsigned int i = step * 3; i <= 0xFF; i += step << 1)
    Crc32Lookup[0][i] = crc ^ Crc32Lookup[0][i - step];
}

By knowing CRCs values of some bit set, you're able to build the CRC of other values by mixing and matching sets.

  • e.g. CRC[7] = CRC[4 ^ 2 ^ 1] = CRC[4] ^ CRC[2] ^ CRC[1]

!!!NOTE!!! : Mixing and matching only works with powers of 2!

Why? Because powers of 2 only have 1 bit set, thus we're able xor them together to create any number from 1 to 255 by "turning on" those bit positions.

step = 0x40(64) -> i = 192
step = 0x20(32) -> i = 96, 160, 224
step = 0x10(16) -> i = 48, 80, 112, 144, 176, 208, 240

eventaully filling out every slot!
image

Experiment:
I chose 4 functions to test and compare. The Original, Standard-table, Improved-branchless and Slicing-by-4

We run each function 5 times in varying sizes, from 1 Byte, 1 KB, 1 MB, 8 MB, 32 MB.

Very simple function to get the time difference:

double time_diff_in_sec(struct timespec start, struct timespec end) {
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9;
}

Results:
crc_comparison

The original implementation is by far the slowest, though it may not seem very obvious due to how spread out the graph is. Since the orignial has to calculate every bit then shift through, it hogs too much time resulting in slow speeds. Branch-free gets rid of the if statement and used (int) type casting for clever gains. But these are results that are expected.

What's far more interseting is the performance difference between Standard table v.s. Slicing-by-4

// Standard-table generation (optimized)
void init_crc32_table() {
    for (int i = 0; i < 256; i++) {
        uint32_t crc = i;
        for (int j = 0; j < 8; j++) {
            if (crc & 1)
                crc = (crc >> 1) ^ POLYNOMIAL;
            else
                crc = crc >> 1;
        }
        crc32_table[i] = crc;
    }
}

// Slicing-4-table generation (optimized)
void init_crc32_slicing4_table() {
    // First row: same as standard table.
    for (int i = 0; i < 256; i++) {
        crc32_slicing4_table[0][i] = crc32_table[i];
    }
    
    // Build rows 1..3.
    for (int row = 1; row < 4; row++) {
        for (int i = 0; i < 256; i++) {
            uint32_t crc = crc32_slicing4_table[row - 1][i];
            crc32_slicing4_table[row][i] = (crc >> 8) ^ crc32_table[crc & 0xFF];
        }
    }
}

Standard table is a table of size 256, while slicing4_table has of a whopping size of 4 * 256

While a bigger table eats up more memory, it also means less calculations are needed to get results. 4 bytes are processed simultaneously

Still, it pales in comparison to hardware accelerated CRC32 calculating. (Basically using lookup tables but using just hardware, achieving results in just 1 or 2 instructions.

3. 以 x86-64 或 Arm64 為探討標的,說明 Linux 核心如何使用硬體加速的 CRC32 指令

How is hardware-accelerated CRC32 done on ARM64?

The circuits for CRC calculation are already built on Arm64 cpus, it's implemented uses a special shift register called LFSR (Linear Feedback Shift Register).

🧱 Circuit components:
Shift Register: Holds the current CRC value.

XOR Gates: Used to "divide" the data polynomial.

Couple wires.

So with dediated hardware, despite manually calculating CRC values, it is still MUCH faster than lookup tables in software.

Is it possible to use lookup tables with hardware?

Yes, but that's a bit hard to test, especially without a FPGA board. Maybe it's something I can return to in the future.

測驗二

A basic Synthesizer in C.

code with blanks filled in
First, understanding each structure is in order:

  • synth_oscillator_t
    • The source of sound, generates waveforms
  • synth_envelope_t
    • Envelopes how a note's volume changes over time
  • synth_filter_t
    • Basic filter for smoothing
  • synth_mixer_t
    • Mixes up to 3 input signals
  • synth_node_type_t
    • Enumeration defining what type a node is
  • synth_node_t
    • Represents one node on the signal chain
    ​​​​typedef struct {
    ​​​​    /* Holds node-specific state (e.g., phase or envelope state).
    ​​​​                       This is reset when a note is triggered (gate event). */
    ​​​​    int32_t state;
    
    ​​​​    q15_t *gain;  /* Pointer to the gain input value. */
    ​​​​    q15_t output; /* Output value of the node. */
    ​​​​    synth_node_type_t type;
    ​​​​    uint8_t param1; /* Optional: potentially used for controlling gain range. */
    ​​​​    union {
    ​​​​        synth_oscillator_t osc;
    ​​​​        synth_envelope_t env;
    ​​​​        synth_filter_t filter;
    ​​​​        synth_mixer_t mixer;
    ​​​​    };
    ​​​​} synth_node_t;
    
    state is a 32 bit interger, with:
    • bit 31 holding Decay‑mode flag
    • bits 4~22 holding Envelope level(fixed part)
    • 0~3 holding Envelope level(fractional)
  • synth_voice_t
    • Represents a playable note!

q15_t & q7_t:

  • First bit represents the sign, while the rest dictates from -1.0 ~ 1.0.
  • q15_t is 16 bits, while q7_t is 8
  • q15_t offers better stepping due to being able to hold more information

q15_t synth_process()

q15_t synth_process()
{
    int32_t main_output = 0;
    for (int vi = 0; vi < SYNTH_VOICES; vi++) {
        synth_voice_t *voice = &synth_voices[vi];
        synth_node_t *nodes = voice->nodes;
        /* First iteration: compute each node's output based on its current
         * state. Temporarily store these calculated outputs before updating the
         * node state.
         */
        int32_t outputs[SYNTH_NODES];
        for (int i = 0; i < SYNTH_NODES && nodes[i].type != SYNTH_NODE_NONE;
             i++) {
            synth_node_t *node = &nodes[i];
            q15_t tmp;
            switch (node->type) {
            case SYNTH_NODE_OSCILLATOR:
                tmp = node->state & 0x7FFF; /* Limit to positive q15 range */
                outputs[i] = node->osc.wavegen(tmp);
                break;
            case SYNTH_NODE_ENVELOPE:
                outputs[i] = (node->state & 0x7FFFFF) >> 4;
                /* Apply squaring to the envelope */
                outputs[i] = (outputs[i] * outputs[i]) >> 15;
                /* If the sustain value is negative, invert the envelope output
                 */
                if (node->env.sustain < 0)
                    outputs[i] = -outputs[i];
                break;
            case SYNTH_NODE_FILTER_LP:
                /* Compute low-pass filter output by scaling the accumulator
                 * with a filter factor. The accumulator holds the previous
                 * state, and this equation applies the filter's smoothing
                 * effect.
                 */
                outputs[i] = (node->filter.accum * node->filter.factor) >> 15;
                break;
            case SYNTH_NODE_FILTER_HP:
                /* Compute as with low-pass filter then subtract from the input
                 * signal.
                 */
                outputs[i] = (node->filter.accum * node->filter.factor) >> 15;
                outputs[i] = *node->filter.input - outputs[i];
                break;
            case SYNTH_NODE_MIXER: {
                int32_t sum = 0;
                for (int j = 0; j < 3; j++) {
                    if (node->mixer.inputs[j])
                        sum += *node->mixer.inputs[j];
                }
                outputs[i] = sum;
                break;
            }
            default:
                break;
            }

            if (node->gain) {
                /* For now, apply a simple linear gain.
                 * Positive gain amplifies while negative gain attenuates the
                 * signal, ensuring compatibility with envelope generators.
                 */
                outputs[i] = (outputs[i] * *node->gain) >> 15;
            }
        }

        /* Second iteration: update each node's state based on the computed
         * outputs.
         */
        for (int i = 0; i < SYNTH_NODES && nodes[i].type != SYNTH_NODE_NONE;
             i++) {
            synth_node_t *node = &nodes[i];
            node->output = outputs[i];
            switch (node->type) {
            case SYNTH_NODE_OSCILLATOR:
                /* Increment the phase by the base value and adjust for any
                 * detuning.
                 */
                node->state += *node->osc.phase_incr;
                if (node->osc.detune)
                    node->state += *node->osc.detune;
                node->state &= 0x7ffff;  /* wrap around q15 */
                break;
            case SYNTH_NODE_ENVELOPE:
                /* The highest bit in the state indicates decay mode
                 * (post-attack).
                 */
                if (voice->gate) {
                    /* When the gate is active, process attack followed by decay
                     */
                    int mode_bit = node->state & 0x80000000;
                    int32_t value = node->state & 0x7fffffff;
                    if (mode_bit) {
                        /* Decay phase: decrement the state by the decay rate */
                        value -= node->env.decay;
                        const q15_t sustainAbs = node->env.sustain < 0
                                                     ? -node->env.sustain
                                                     : node->env.sustain;
                        if (value < sustainAbs << 4)
                            value = sustainAbs << 4;
                    } else {
                        /* Attack phase: increment the state by the attack rate
                         */
                        value += node->env.attack;
                        if (value > (int32_t) Q15_MAX << 4) {
                            value = (int32_t) Q15_MAX << 4;
                            /* Switch to decay mode once the peak is reached */
                            mode_bit = 0x80000000;
                        }
                    }
                    node->state = value | mode_bit;
                } else {
                    /* When the gate is inactive, perform the release phase */
                    /* Clear the decay mode bit so that the next gate triggers a
                     * fresh attack.
                     */
                    node->state &= 0x7fffffff; 
                    node->state -= node->env.release;
                    if (node->state < 0)
                        node->state = 0;
                }
                break;
            case SYNTH_NODE_FILTER_LP:
                /* Update the accumulator for the low-pass filter based on the
                 * input difference.
                 */
                node->filter.accum += (*node->filter.input - node->output);
                break;
            case SYNTH_NODE_FILTER_HP:
                /* Update the accumulator for the high-pass filter in a similar
                 * fashion.
                 */
                node->filter.accum += (*node->filter.input - node->output);
                break;
            case SYNTH_NODE_MIXER:
                /* Mixer nodes do not require state updates in this pass */
                break;
            default:
                break;
            }
        }

        /* Accumulate the output from the primary node of the voice into the
         * main output.
         */
        main_output += voice->nodes[0].output;
    }

    const q15_t main_mixer_gain = Q15_MAX / SYNTH_VOICES;
    return (main_output * main_mixer_gain) >> 15;
}

It's the heart of this synthesizer in c. It creates an audio sample running through every voice and every node.

For our song "Twinkle twinkle little stars", there're only 2 voices, defined by SYNTH_VOICES.

What's done each pass?

First Pass:

Synthesize Oscillator:

  • grab the 15 LSB bits and add to outputs

Synthesize Envelope:

  • Extract the envelope’s current level with ((state & 0x7FFFFF) >> 4)
  • Squares for a smooth curve, inverts if sustain is negative.

Low‑Pass Filter: (smooths rapid changes)

  • Calculate with accumulator by its coefficient for a smoothed output.
    High‑Pass Filter: (Emphasizes rapid changes)
  • Calculate with accumulator then subtract from the raw input by its coefficient for a smoothed output.

Second pass:

Oscillator's phase is increased by its base pitch (phase_incr) plus any detune, then wrap (& 0x7FFF) so it loops back to zero at the end of a cycle.

Envelope:

  • If the note gate=1, we advance through attack → decay phases
  • Once released at gate=0, decrement the state by the decay rate.

Both Filters update the accumulator

Goals:

  1. 解釋上述程式碼運作原理,搭配「信號與系統」教材
  2. 探討定點數的使用和其精確度
  3. 改進 MIDI 一閃一閃亮晶晶的音效輸出

The use of Fixed point

q15 and q7 are the formats for fixed point in this project. Just like mentioned above they represent 1.0 ~ -1.0

q15 has a resolution of 1/32768 or 3.05×10^−5

while 14 is only 1/128 or 7.8×10^−3

Why not use floats?
Interger operations are much quicker to do, bit shifts and such are also disallowed on floats. Also floats have a hard time representing integer numbers

But this also means that the smallest number that can be displayed IS the resolution of the fixed point structure. Factors like overflowing and underflowing must also be kept in mind

Changes to this MIDI

I didn't like how notes sounded, personally I prefer a crisper notes, kinda like a piano.
There are 2 voices

Voice 0:

synth_init_envelope_node(&voice->nodes[1], NULL, /* gain */
     300,                    /* attack */
     10,                    /* decay */
     Q15_MAX * 0.2,           /* sustain */
     50                     /* release */
);
...
synth_init_filter_lp_node(&voice->nodes[0], NULL,  /* gain */
     &voice->nodes[3].output, /* input */
     10000                     /* factor */
);

Voice 1:

synth_init_envelope_node(&voice->nodes[1], NULL, /* gain */
     1000,                    /* attack */
     200,                    /* decay */
     Q15_MAX * 0.4,          /* sustain */
     15                      /* release */
);
...
synth_init_filter_lp_node(&voice->nodes[0], NULL,  /* gain */
     &voice->nodes[2].output, /* input */
     8000                     /* factor */
);

For voice 0, I bumped the attack up and made the decay and release very quick. Voice 1's attack was bumped WAY up with an even smaller release.

Low pass filter for voice 0 was also increase to 10000.

...
if (note) {
    /* Voice 0 plays the note as given, and Voice 1 plays two
     * octaves lower.
     */
    synth_voice_note_on(&synth_voices[0], note);
    synth_voice_note_on(&synth_voices[1], note - 12);
}
...

Finally, instead of Voice 1 playing two octaves lower, it's only one.

const uint8_t twinkleTwinkleBeats[] = {
    4, 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4, 4, 3, 5,
    4, 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4, 4, 3, 5,
};

The length of notes was also tweaked a little bit, the original had this awkward pause between each line. This version remedies this.

測驗三

Goal:

  1. 解釋上述程式碼運作原理
  2. 學習 jserv/ttt 程式碼,在不使用 (n)curses 一類現成的函式庫前提下,改進網路聊天室的文字呈現介面
  3. 在既有的程式碼之上,實作帳號登入 (login) 及基本管理功能,並該有線上與否的提示,留意 ping value 的處理機制

This is a simple chat program using socket and the "select()"" system call.

The code for this chat system is very self explanatory, so I'll just focus on parts I didn't initially understand.

struct sockaddr_in sin = {
    .sin_family = AF_INET,
    .sin_port = htons(port),
    .sin_addr = {.s_addr = htonl(INADDR_ANY)},
};

This is the stucture storing an instance of a socket. SIN stands for Socket-INternet.

  • sin_family holds a IPv4 address
  • sin_port is the port(converted using htons)
  • s_addr = htonl(INADDR_ANY) binds the address to all available networks
fd_set rfds;
FD_ZERO(&rfds); 
FD_SET(server_fd, &rfds);
int max_fd = server_fd + 1;
    while (select(max_fd, &rfds, NULL, NULL, NULL) >= 0) {
        /* Check if the server socket has become readable, which indicates an
         * incoming connection */
        if (FD_ISSET(server_fd, &rfds)) {
            ...

fd_set is a data type defined in <sys/select.h>, it represents a set of file descriptors. Which select() can then use to check if any are ready for I/O operations

select():

Takes in 5 parameters: nfds, readfds, writefds, exceptfds, timeout

nfds:

  • Integer representing the biggest file descriptor + 1

readfds:

  • A pointer to an array of fd_set that indicates which file descriptors to monitor reading

writefds:

  • Same as 'readfds' but for writing

exceptfds:

  • A pointer to an array of fd_set for monitoring exceptions

timeout:

  • A pointer to a struct timeval that specifies how long select() should wait

At NULL select() will wait indefinitely until a descriptor becomes ready

Behavior:

  • select() pauses the program until at least one file descriptor is ready, or until the timeout expires.

After returning, fd_set is updated too

When to use select()?
When handling multiple Connections, so instead of dedicating a thread or process, we can just use select(). This is especially useful for simple servers or applications where multithreading might add unnecessary complexity.

Efficiency:
select() has limited scalability, but it works well for small to moderate numbers of connections and is widely supported across different operating systems

Improved chat room

Looking at the tic-tac-toe program example, we can see an interesting output:
image
Despite just being a part of just a terminal, there's a background and colored characters, effectively simulating UI elements.

How is it done? Is it magic???

It is indeed magical! But it isn't magic.

'\033' // Escape character

This character tells the following characters are commands and not just plain text, this happens until reaching the character 'm'.

'[' // Control Sequence Introducer

This character then indicates that a Control Sequence command is coming

So for example:
"[47" sets the background to white,
"[31" sets the foreground text to red.
"[0" will reset the formatting si subsequent text aren't affected.

30: Black
31: Red
32: Green
33: Yellow
34: Blue
35: Magenta
36: Cyan
37: White

printf("\033[47m\033[31mX\033[0m");

So all together a line like this will print out:image
(A red X on a white background)

How to start the chatroom?

$ gcc chat.c -o chat -Wall -Wextra

Adding -Wall and -Wextra when compiling enables some common checks that check for potential issues that may be present in your code. Alerting you about potential issues.

$ ./chat 8080 // choose your own port

! ! Choose a port that isn't being used by other applications on the same network ! !

Opening up another terminal and connecting with:

nc localhost 8080

Since both terminals are running on litterally the same device, localhost will get the correct IP.
image

Back to the code, adding color

Goal:
Each client should have a distict color on a white background.

int client_colors[FD_SETSIZE] = {0};
int user_cnt = 0;
// Using an array to keep track of each client's color code
    ...
    conns[new_fd] = 1;

    // assign a color to this user
    client_colors[new_fd] = 30 + user_cnt++ % 7;

After recieving a valid connection, a color is assigned to the user.

...
else if (nread > 0) {
    
    // set the last character to '\0' instead of '\n'
    buf[nread-1] = '\0';

    char colored_msg[1024 + 20]; // extra space for escape characters
    
    // snprintf returns the len of new string, negating the need for strlen() later on
    int colored_len = snprintf(colored_msg, sizeof(colored_msg), "\033[47m\033[%dm%.*s\033[0m", client_colors[fd], nread, buf);


    printf("[%d] read: %.*s\n", fd, colored_len, colored_msg);

    /* loop over all our connections, and send stuff onto them!*/
    for (int dest_fd = 0; dest_fd < FD_SETSIZE; dest_fd++) {
        /* take active connections, but not ourselves */
        if (conns[dest_fd] && dest_fd != fd) {
            /* write to them */
            if (write(dest_fd, colored_msg, colored_len) < 0) {
                /* disconnect if it fails; they might have
                 * legitimately gone away without telling us */
                fprintf(stderr, "write(%d): %s\n", dest_fd,
                        strerror(errno));
                close(dest_fd);
                conns[dest_fd] = 0;
            }
        }
    }
...

Results:

Connection accross systems:
Windows ⬇️:
Untitled
Ubuntu ⬇️:
image

Testing different connection colors:
image

Source code

Adding accounts and login Functionality

File chat.c remains largely unchanged, most new features are added by new files "login.c" and "login.h"

Changes to chat.c:

if (new_fd < 0) {
                perror("accept");
            } else {
                printf("[%d] connect from %s:%d\n", new_fd,
                    inet_ntoa(sin.sin_addr), ntohs(sin.sin_port));
                
                // Welcome message
                client_printer(new_fd);
                // Show home menu
                int user_index = client_Homemenu(new_fd);
                if (user_index == -1) {
                    close(new_fd);
                    continue;
                }

                int onoff = 1;
                if (ioctl(new_fd, FIONBIO, &onoff) < 0) {
                    printf("fcntl(%d): %s\n", new_fd, strerror(errno));
                    close(new_fd);
                    continue;
                }
                conns[new_fd] = 1;
                // [CHANGE]: Color is now determined by the user’s account index.
                client_colors[new_fd] = 30 + (user_index % 7);
            }

client_printer sends the client a welcome message, client_Homemenu shows basically the homescreen.

login.h

#ifndef LOGIN_H
...
#endif

This "if" definition makes sure that inclusions of a header can only happen ONCE

login.c

The logic of this user system and "database"(it's quite a stretch to call this a database) is quite simple. I tried to keep it simple as the scale of this chatroom is quite small, it only needs to do it's job.

So fancy features like encryption for passwords on database, or hashmaps. Especially no red-black trees or B+ trees or sql. BUT I will do it if the goal was to implement a login chatroom that supports tens of thousands of users.

full code on Github

Helper functions

static void client_send(int client_fd, const char *msg) {
    write(client_fd, msg, strlen(msg));
}

static int client_readline(int client_fd, char *buffer, int maxlen) {
    int total = 0;
    while (total < maxlen - 1) {
        char c;
        int n = read(client_fd, &c, 1);
        if (n <= 0) {
            return n; // error or disconnect
        }
        if (c == '\n')
            break;
        buffer[total++] = c;
    }
    buffer[total] = '\0';
    return total;
}

client_send handles sending messenges to the client, while client_readline reads their input.

client_readline reads characters one by one, while may be a bit slow, it does prevent injections from overflowing strings.

int size = 0;
int db_max = 10;

char** user_db;
char** pswd_db;

void init_db(){
    user_db = (char**)malloc(sizeof(char*) * db_max);
    pswd_db = (char**)malloc(sizeof(char*) * db_max);
    for (int i = 0; i < db_max; i++){
        user_db[i] = (char*)malloc(MAX_USERNAME_LENGTH);
        pswd_db[i] = (char*)malloc(MAX_PASSWORD_LENGTH);
    }
}

void expand_db(){
    int new_size = db_max * 2;
    user_db = (char**)realloc(user_db, sizeof(char*) * new_size);
    pswd_db = (char**)realloc(pswd_db, sizeof(char*) * new_size);  // [CHANGE]: fixed reallocation for pswd_db
    for (int i = db_max; i < new_size; i++){
        user_db[i] = (char*)malloc(MAX_USERNAME_LENGTH);
        pswd_db[i] = (char*)malloc(MAX_PASSWORD_LENGTH);
    }
    db_max = new_size;
}

The quote on quote "database" is just 2 arrays of size db_max. If there are ever too many users, the size is then doubled to accommodate.

Loop through both arrays allcating memory.

// constants defined in login.h
#define MAX_USERNAME_LENGTH 15
#define MAX_PASSWORD_LENGTH 12
static bool check_password(int index, const char* password){
    return strcmp(pswd_db[index], password) == 0;
}

static int find_account(const char* username){
    for (int i = 0; i < size; i++){
        if (strcmp(user_db[i], username) == 0)
            return i;
    }
    return -1;
}

bool create_account(const char* username, const char* password){
    if (find_account(username) != -1)
        return false;  // account already exists
    if (size >= db_max)
        expand_db();
    strcpy(user_db[size], username);
    strcpy(pswd_db[size], password);
    size++;
    return true;
}

bool del_account(const char* username){
    int index = find_account(username);
    if (index == -1)
        return false;
    for (int i = index; i < size - 1; i++){
        strcpy(user_db[i], user_db[i + 1]);
        strcpy(pswd_db[i], pswd_db[i + 1]);
    }
    size--;
    return true;
}

I believe these 4 functions are pretty self explanatory.

Accounts are sorted by the time of creation
once deleted, every indexed account afterwards will just be rolled forward once.

int client_Homemenu(int client_fd){   // [CHANGE]: Uses client_fd for I/O.
    char choice[10];
    while (1) {
        client_send(client_fd, "=============================================\n");
        client_send(client_fd, "Welcome to BASIC CHATROOM, please enter:\n");
        client_send(client_fd, "    (1) to login\n");
        client_send(client_fd, "    (2) to create an account\n");
        client_send(client_fd, "    (3) to exit\n");
        client_send(client_fd, "Choice: ");
        if (client_readline(client_fd, choice, sizeof(choice)) <= 0)
            return -1;
        if (choice[0] == '1') {
            int index = client_loginmenu(client_fd);
            if (index != -1)
                return index;
        } else if (choice[0] == '2') {
            client_createmenu(client_fd);
        } else if (choice[0] == '3') {
            client_send(client_fd, "Goodbye!\n");
            return -1;
        } else {
            client_send(client_fd, "Invalid choice. Please try again.\n");
        }
    }
}

This is the homescreen, it shows some simple text and prompts for an input. Which then shows the corresponding screen.

The rest of the functions really are quite simple, so I'll just leave the github here for those interested.

Getting Online Users

The standard way to get the ping of a user, is to basically send a message or "ping" there, log the time, then await for a "pong" response back. The time difference inbetween is the ping.

But since we're using netcat for connecting, we can't code anything to the client's side, so there's no way to get the client send a "pong" signal automatically when we send "ping".

Next best thing: commands
When the user's logged in, any text that starts with '/' will be a command and won't be broadcasted to other users.

/hello -> greets the user with "Why hello!"
/online -> return a list of every user that's currently online
image

Results:

image

Possible issues I randomly thought of:

  • What if a user is logged in from 2 locations?
    • What if one of said user deletes the account, what happens to the other user?
    • Will 2 accounts logged in at the same time be able to see their own messages from the other terminal?
    • Should the same user even be able to login from 2 seperate instances?
  • What if the user sent escape characters to the chat too? Would that mess up the formatting for others
  • Should the messages being sent be filtered for junk, swear words or sexual/illegal content?
  • Just how many users can connect to this system before everything comes falling down?
  • Since this is a single core server with no scheduler, it can only handle one user at a time, so if the server was waiting for a response from a user, the entire chatroom just stops
    May 30, 2025