2022 年「資訊科技產業專案設計」作業 1

貢獻者: 起司堡-cheeseburger

模擬面試錄影(漢)
模擬面試錄影(英)

🎩：interviewer 🍔：interviewee

217. Contains Duplicate (漢語)

[影片 (漢語)]

面試過程與程式碼

Question Description
🎩: Given an integer array nums,if any value appears at least twice in the array return true, if every element is distinct return false

Input: nums = [1,2,3,1], int dataSize = 4
Output: true

Input: nums = [1,2,3,4], int dataSize = 4
Output: false

Constraints:
1 <= nums.length <= 10^5
-10^9 <= nums[i] <= 10^9

Repeat and Examples
🍔: 是否可以改寫或調動 input array 的內容?
🎩: 可以

🍔: 是否只要當重複 value 出現時，便可以直接輸出 True/False?
🎩: 是

🍔: 依照您所給的 Constraints，所以我不必考慮 nums = [] 或是若我以 C 語言實作並指定 pointer to int 來接收 input array，我暫時不必考慮被傳入 NULL pointer?
🎩: 不會被傳入空陣列或 NULL pointer，假設都傳入合法整數陣列

#Approach
🍔: 我想這個題目可以使用 Hashtable 來處理，(說明演算法同時註解，請參照影片)，不過由於 C 語言的 Standard header 並沒有提供 Hashtabled 的實作，為了方便演算法的實作與解說，我希望可以使用 uthash 這個第三方開發的 header file 裡的 macro 來輔助解題。就我所知，uthash 已經被 include 於目前的 LeetCode，並且 uthash 的官方文件中強調它大部分程式碼以 macro 撰寫，其 Add, find, delete 操作通常都能在常數時間完成。

Is uthash fast?
Add, find and delete are normally constant-time operations. This is influenced by your key domain and the hash function.

This hash aims to be minimalistic and efficient. It’s around 1000 lines of C. It inlines automatically because it’s implemented as macros. It’s fast as long as the hash function is suited to your keys.

🎩: 可以，用吧 (yes I know that, and you can use uthash header if that help you expain your implementation.)

Code

// 1) build a hashtable HT <int key: num, bool value: appear> 
// 2) for num in nums:
//        if num not in HT: // map_find()
//            add num into HT // map_add()
//        else
//            return true
//    endloop
// 3) return false // every element is distinct

//#include "uthash.h"
typedef struct {
    int key;           // id 
    UT_hash_handle hh; // next prev hh_next hh_prev;
} element_t;

element_t *head = NULL;

void add(int key)
{
	element_t *n = malloc(sizeof(element_t)); 
	n->key = key;
    HASH_ADD_INT(head, key, n);
}

void hash_free()
{
    element_t *pos, *n; // n = (pos) ? pos->app_next : NULL;
    HASH_ITER(hh, head, pos, n){
        HASH_DEL(head, pos);
        free(pos);
    }
    /* 原版 HASH_ITER 風格的實作如下：
    element_t *pos, *n;
    for (pos = head, n = (pos!=NULL) ? pos->hh.next : NULL; \
         pos; \
         pos = n, n = (n!=NULL) ? n->hh.next : NULL){
        HASH_DEL(head, pos);
        free(pos);
    }*/
}

// O(n)
bool containsDuplicate(int* nums, int numsSize)
{
    // O(n)
    // 1) loop num in nums	
    for (int i =0; i < numsSize; i++){
	// 2) check num in HT?
            // if not find then, add num to HT // find() add()
            // if find, return true 
            element_t *n;
            HASH_FIND_INT(head, &nums[i], n);
            if (n) {
                hash_free();
                return true;
            }
            else {
                add(nums[i]); // use HASH_ADD_INT(head, key, n);
            }
        }
    // 3) end of loop, return false // every num in nums is distinct.
	hash_free();
    return false;
}

補充: uthash 官方文件
補充: LeetCode: "uthash.h" is included by default

Test
🎩: 好，那麼分析一下時間複雜度吧
🍔: O(n)，improvement 策略：不需要 counts (假設一開始定義 element_t 時不小心定義 count 變數的話)

可能的 Follow-up
🎩: 你剛剛說你有看過 uthash 的官方文件，可以大致說明一下 uthash 的實作方式嗎?
🍔: uthash.h 和 linux kernel 所提供的 hashtable.h 的實作方式有些微不同，兩者都是建立一條單向鏈結串列作為 buckets，透過 uthash 提供的 hash function 可以將 key 對應至其中一個 bucket，接著每個 bucket 會有一個 head 指標指向一條雙向鏈結串列，每個 node 允許使用者封裝成自訂義物件，也就是我剛才所定義的 element_t。

我認為 uthash.h 為了增加 hashtable 的使用方便性，其實作相對 linux kernel 的 hashtable.h 複雜許多，uthash.h 提供的 UT_hash_handle 其實是多種指標、變數的複合結構體，bucket 和 element_t 的結構都要嵌入 UT_hash_handle，但不是每個變數或指標都會用上，若想要去更改 HASHTABLE 的實作細節，反而讓開發變得困難。而 linux kernel 提供的 hash table 相關實作比較像是各種小模組、小零件，結構設計上有專業分工，例如雙向鏈結串列的 hlist_node 只用於 element_t、單向鏈結串列的 hlist_head 只用於 buckets 指向各自的 hlist_node，對程式開發者而言，自由度更高。

//#include <linux/type.h>
struct hlist_head {struct hlist_node *first;};
struct hlist_node {struct hlist_node **pprev, *next;};

typedef struct{
    int key;
    int counts;
    struct hlist_node *node ;
}element_t;

自評檢討

面試過程

#Test 時忘記跑一次範例資料做驗證
要練習用 Tab 鍵來排版，習慣用空白鍵很沒效率
對於 Interviewer 應該要如何出題及引導，還不是很明白，因此影片中的 Interviewer 沒做什麼事情

影片中把 uthash 的 app_next 名稱講錯了 (雖然概念是對的)，在 uthash.h 中，app order 應該是 void *prev, *next; 而 bucket order 是 struct UT_hash_handle hh_prev, hh_next:

typedef struct UT_hash_handle {
   struct UT_hash_table *tbl;
   void *prev;                       /* prev element in app order      */
   void *next;                       /* next element in app order      */
   struct UT_hash_handle *hh_prev;   /* previous hh in bucket order    */
   struct UT_hash_handle *hh_next;   /* next hh in bucket order        */
   const void *key;                  /* ptr to enclosing struct's key  */
   unsigned keylen;                  /* enclosing struct's key len     */
   unsigned hashv;                   /* result of hash-fcn(key)        */
} UT_hash_handle;

程式部分

element_t 的變數命名成 key，似乎會有解說上的困擾，也許用別的名稱再搭配註解為 key 就好

程式碼 uthash 的 macro 引述與使用上有缺失

HASH_DEL(head, pos); // 模擬面試影片裡少傳入 head 參數

HASH_FIND_INT(head, &nums[i], n); 
// 模擬面試影片裡，省略 find function 之後，HASH_FIND_INT 不需要再賦值給 n，
// 因為 macro 已將 n 作為參數傳入其展開的式子之中

Google Docs 為演算法步驟上色雖然好看，但實戰中應該是多此一舉，尤其當我想直接複製演算法中的關鍵字時，複製的文字會帶有顏色，若還要去調色就是浪費時間、徒增困擾。

同儕互評

Interviewer
Interviewee
- 2:30 覺得把演算法的步驟列出的很棒，可以很清楚知道有怎樣的判斷

同學檢討02

Interviewee
- 1:56 有考慮到是否會傳入空陣列
- 2:25 先以pseudo code闡述自己的想法很好
- 6:55 感覺花了太多時間在說明header與如何使用
- 15:18 感覺不用寫出return那一行後再刪掉，先把HASH_FIND_INT說明完再直接替代掉會比較好。
Interviewer
- 25:40 有問到時間複雜度
整體檢討：因為使用c語言實作，所以感覺花了很多時間在實作function，整體上比較像是呈現面試者對於c語言下hash table如何實作，所以我看完會有點遺忘問題本身是什麼。

同學檢討03

Interviewee
- 1:06 有針對題目的限制或條件進行詢問
- 9:59 用不同顏色表示來凸顯重點，看得更清楚，很好
Interviewer
- 28:13 有把問題延伸很好

同學檢討04

Interviewee
- 有詳細的確認允許的操作及假設很棒
- 3:43 應把 find 改成 found，這裡是當形容詞用，以過去分詞當作形容詞
- approach 步驟有列出很好
- 有向 interviewer 詢問是否可用第三方工具且有陳述使用原因、優缺點
- 可以說明一下 UTHASH 用 macro 而不是 function 的原因，或是由 interviewer 發問
Interviewer
- 可以嘗試把 constraint 縮小到 num[i] 範圍只有到 1000 之類的，詢問是否有更好的作法而不用 hash table，或是有沒有除了 hash table 以外的作法
- 可以嘗試往 hashtable 的其他特性延伸，例如需要 resize 的情境、concurrent operation 等，或是問問看 bloom filter 的相關問題

393. UTF-8 Validation (漢語 & 英語)

[影片 (漢語)] [影片 (英語)]

面試過程與程式碼

Question Description

Number of Bytes   |        UTF-8 Octet Sequence
                  |              (binary)
------------------+-------------------------------------
            1     |   0xxxxxxx
            2     |   110xxxxx 10xxxxxx
            3     |   1110xxxx 10xxxxxx 10xxxxxx
            4     |   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
--------------------------------------------------------
x denotes a bit in the binary form of a byte that may be either 0 or 1.

🎩: 這個表格說明合法的 UTF-8 編碼規則，一個 UTF-8 字元可以表示成 1 至 4 bytes 的長度，基本上就是兩大規則。當 UTF-8 字元表示成 1 byte 長度時，第一個 bit 會是零;當 UTF-8 字元表示成 2 bytes 以上 (含) 長度時，會有 n 個 bits 的 1 及 n+1 bits 是 0，n 指的是長度為 n 個 bytes (n > 1)。所以你會有第一個整數陣列型態的 input (複製貼上到 google doc)，當然第二個參數是陣列的長度。

(example 1)
data           = [     197,     130,       1]
data_to_binary : [11000101 10000010 00000001]

int dataSize = 3
return true

Examples
🎩: 你會注意到 example 1 是符合規則的，因為前 2 個 bits 全是 1，第 3 個 bit 為 0 表示它是一個 2 bytes 字元。下一個 byte 是一個以 10 開頭的連續字節，這是正確的。第三個整數是 0 bit 開頭，表示是 1 byte 字元，也符合規則，故回傳 true。

它是 2 bytes 的字元跟 1 byte 字元組成的合法 utf-8 編碼，因為第一個整數轉二進制表示法開頭是 110，表示要看第二條規則；而第三個整數轉二進制開頭是 0，表示要看第一條規則。

而 example 2 是不合法的 utf-8 編碼，因為前 3 個 bits 全是 1，第 4 個 bit 為 0 表示它是一個 3 bytes 字元。
下一個 byte 是一個以 10 開頭的連續字節，這是正確的。
但是第二個連續字元不是以 10 開頭，所以是無效的。

(example 2)
data           = [     235,     140,       4]
data_to_binary : [11101011 10001100 00000100]

int dataSize = 3
return false

請問到目前有理解嗎?

Repeat and Examples
🍔: 所以我可以當作每一組 input data 是一條資料串流或文字檔案，然後我要循序解碼，第一個整數轉 binary 後，要看前面幾個 most significant bits 來判斷它要接續幾個整數…
🎩: 不是依 most significant bits 判斷，抱歉，這是一個剛剛沒提醒到的 caveat，只有每個整數的最後 8 bits 用於存儲資料，也就是 least significant 8 bits，換句話說每個整數僅代表 1 bytes 的資料。

LeetCode Description:
Note: The input is an array of integers. Only the least significant 8 bits of each integer is used to store the data. This means each integer represents only 1 byte of data.

🍔: 喔! 我懂了，所以整數轉成二進制之後只需要看最後 8 個 bits，再檢查倒數 0 到 5 個 bits，來判斷編碼字元長度，而接續的整數只要檢查轉換後相同規則下是否為 10 開頭即可。而第一個整數只要是 0 開頭就符合。
🎩: 是的，另外這是輸入資料的限制，輸入資料正整數，且至少包含一個數字。

Constraints:
1 <= data.length <= 2 * 10^4
0 <= data[i] <= 255

🎩: 所以你要做的事情就是寫個 function，可以讀取一條正整數陣列，並判斷是否符合 utf-8 編碼規則。

Approach

漢語影片與英語影片闡述演算法的方式不同，漢語影片嘗試使用有限狀態機 (finite-state machine)，英語影片使用文字描述的方式撰寫演算法步驟。
漢語影片：稍微帶一下題目跟我理解狀態機處理 stream data 的經驗類似，並說明 utf-8 encoding 可以如何拆解成狀態機中的狀態，並進行分析。同時帶入狀態機中遇到輸入 NULL 及狀態簡化的概念，找尋其他可能的資料型態案例，一邊帶入測試資料，一邊驗證整個演算流程的正確性。(#Test)
英語影片：帶入 HTTP 封包解析的經驗，說明利用 HTTP 協定傳送圖片時，接收方會需要先從 package header 得知 content-length，才能進一步解析正確 img stream 的例子，來解釋我接下來會用相似概念來做 utf-8 encoding。

Code

漢語影片: 因為討論狀態機並同時做測試驗證耗時較多，時間上較緊迫，因此設計成 interviewer 不要求寫出完整 code，但仍詢問判斷 num 是否符合 utf-8 encoding 規則的程式碼會如何實作。因時間關係，interviewee 選擇直接講使用 shift operation 的實作方式。

/* 漢語影片參考程式碼 */
// 1) whlie loop        // O(k + n-k) = O(n)
// 2)     if condition // O(1)
// 3)     whlie loop  // O(k) 

(START) → num → (STATE 1)
(START) → NULL → (GoodEnd)
(STATE 1: check first num rule, get byte_length, following_byte = byte_length - 1)
	“legal”, byte_length = 1 → (START)
	“legal”, byte_length > 1 → num, following_byte → (STATE 2)
	“legal”, byte_length > 1 → NULL → (END)
	“illegal”→ (END)
(STATE 2: check following num rule, following_byte--)
	“legal”, following_byte > 0 → num → (STATE 2)
	“legal”, following_byte > 0 → NULL → (END)
	“legal”, following_byte = 0 → (START)
	“illegal” → (END)
(END: return false;)
(GoodEnd: return true;)

// input:   11000101
// rule 1:  0xxxxxxx
// if ((input >> 7) == 0)

// input:   11000101
// rule 2:  110xxxxx
// if ((input >> 5) == 110)

英語影片: (本來直接實作應該要更省時間，但錄製模擬面試影片時忘記控管時間)
- 採用多條 if condition 來做規則過濾，並適時讓 interwiewer 給出關鍵提醒或提示，例如:
  - 🎩: Before you jump into code, why don't you walk me through more detail that how would you conceptionally do these three steps?#Test
  - 🎩: (提示當 input array 只有一個 integer 時，成式能否正確，或是要如何修改)#Test
- interviewee 實作「檢查 utf-8 rule」的方法時，額外討論 bitmask 與 shift operation 兩種方式。

/* 英語影片參考程式碼 */
bool validUtf8(int* data, int dataSize)
{
    // 1) read a num in data array
    //     int i = 0; while loop
    int i = 0;
    while(i < dataSize){
        // 2) check the byte_len of the utf-8, follow_byte = byte_len - 1 
        //	int follow_byte = 0;
        //      use “if ((num >> 7) == 0)” to find byte_len in 4 rule, follow_byte
        //      		if  num is valid && rule 1: continue 
        //		if  num is valid && rule 2~4: goto step 3.
        //		if  num is invalid: return false; 	
        int byte_len = 0, follow_byte = 0;
        if ((data[i] >> 3) == 0b11110)       {i++; byte_len = 4; follow_byte = byte_len - 1;} // rule 4
        else if ((data[i] >> 4) == 0b1110) {i++; byte_len = 3; follow_byte = byte_len - 1;} // rule 3
        else if ((data[i] >> 5) == 0b110)   {i++; byte_len = 2; follow_byte = byte_len - 1;} // rule 2
        else if ((data[i] >> 7) == 0b0)       {i++; continue;} // rule 1
        else {return false;}
        // 3) keep read a num with follow_byte times, check num meet rule of 10xxxxxx
        // 	while(follow_byte- -){
        // 		i++; // read a next num;
        //		if (follow_byte > (dataSize - i)){return false;}
        //   		like ““if ((num >> 7) == 0)”” method to check num meet rule of 10xxxxx
        //			if  num is valid: keep read
        //			if  num is invalid: return false; 
        //	}
        while(follow_byte--){
            if (follow_byte > (dataSize - i)){return false;}
            if ((data[i] >> 6) == 0b10) {i++; continue;}
            else {return false;}
        }
    }
    // 4) end of while: return true; 
    return true;
}

#Test
🍔: time complexity O(n),
🍔: improve from bitmask to bit shifting, == 可以用 !(^) 取代

自評檢討

面試過程

漢語影片中，legal 拼錯，寫成 “llegal”
若要使用有限狀態機，用文字方式來做討論比作圖更花時間，要控管時間
模擬面試時間太長、要注意時間

程式部分

英語影片中的撰寫的演算法步驟，太多廢話，看起來很雜亂

可嘗試改進部分
code

// O(n)
bool validUtf8(int* data, int dataSize){
    // 1) process input
    int followbytes = 0;
    unsigned char check8bits;
    int i = 0;
    while (i < dataSize){ // O(n)
        // 2) take one integer, determine byte length of utf-8
        check8bits = data[i];
        if ((check8bits >> 7) == 0b0) {i++; continue;}             // rule 1
        
        if ((check8bits >> 3) == 0b11110) {followbytes = 4-1;}     // rule 4
        else if ((check8bits >> 4) == 0b1110) {followbytes = 3-1;} // rule 3
        else if ((check8bits >> 5) == 0b110) {followbytes = 2-1;}  // rule 2
        else {return false;} // if ((check8bits >> 3) == 0b11111) {return false;}
        i++;
        // 3) check the following (n-1) utf-8 byte
        if (followbytes > (dataSize - i)) {return false;}
        for (int j = 0;  j < followbytes; j++){ // O(m), m included in n
            check8bits = data[i];
            if ((check8bits >> 6) == 0b10) {i++;}
            else {return false;}
        }
    }
    return true;
}

同學互評 1

Interviewer
- 感覺這類型的題目有點尷尬，很容易太過冗長，光是在制定策略前的釐清規則就花不少時間，interviewer 可以適當化簡或剪裁，針對想測試的部份挑出來討論就好
- 整體大部分時間花在實作上，比較難觸及 interviewee 的其他層面能力或經驗，比較不容易在有限的時間內了解這個人