資訊科技產業專案設計課程作業 5

# 資訊科技產業專案設計課程作業 5 ## Coding Interview - As an Interviewer :::success <span style="position: absolute; left: 50%; transform:translateX(-50%); font-size: 20px;">**Word Count Engine**</span> <div style="height: 20px"></div> Implement a document scanning function `std::vector<std::vector<std::string>> WordCountEngine(const std::string &document);`, which receives a string document and returns a list of all unique words in it and their number of occurrences, sorted by the number of occurrences in a descending order. If two or more words have the same count, they should be sorted according to their order in the original sentence. Assume that all letters are in english alphabet. You function should be case-insensitive, so for instance, the words “Perfect” and “perfect” should be considered the same word. The engine should strip out punctuation (even in the middle of a word) and use whitespaces to separate words. Analyze the time and space complexities of your solution. Try to optimize for time while keeping a polynomial space complexity. **Examples:** - **input:** `document = "Practice makes perfect. you'll only get Perfect by practice. just practice!"` - **output:** `[ ["practice", "3"], ["perfect", "2"], ["makes", "1"], ["youll", "1"], ["only", "1"], ["get", "1"], ["by", "1"], ["just", "1"] ]` **Important:** please convert the occurrence integers in the output list to strings (e.g. "3" instead of 3). We ask this because in compiled languages such as C#, Java, C++, C etc., it’s not straightforward to create mixed-type arrays (as it is, for instance, in scripted languages like JavaScript, Python, Ruby etc.). The expected output will simply be an array of string arrays. **Constraints:** + [time limit] 5000ms + [input] `string document` + [output] `array.array.string` ::: ### 自我嘗試 (C++) 首先，第一個遇到的問題是如何拆解 string。在 C 中，可以使用 `strtok(char *str, char *delimiter)` 然而，這邊會遇到的問題是**題目並未明說有哪些標點符號會出現**。於是，我想到可以先淨化(?) input string。由於題目傳入的是 `const std::string& document`, 必須複製一份，剛好可以在轉移途中利用 `std` 當中的 `std::isalpha()`、`std::isspace()`和 `std::to_lower()` 將字串轉為只由小寫字母和空白構成的空間。 ```cpp= // Sanitize string std::string str; for(auto c : document){ if(!std::isalpha(c) && !std::isspace(c)){ continue; } str.push_back(std::tolower(c)); } ``` 接著，需要將各個單字拆開。由於字串已經剩下空白區隔，我們可以將其轉換成 `std::istringstream` 並利用 `getline()`，將 deliminator 設定為空白字元紀錄單字順序，順便利用 `std::map<std::string, int>` 計算出現次數： ```cpp= // Count occurence of uique words with order kept std::unordered_map<std::string, int> m; std::istringstream iss(str); std::string temp; std::vector<std::string> words; int max = 0; while(std::getline(iss, temp, ' ')){ if(temp.empty()){ continue; } if(m.find(temp) == m.end()){ m.insert({temp, 1}); words.push_back(temp); if(max < 1){ max = 1; } } else{ m[temp]++; if(m[temp] > max){ max = m[temp]; } } } ``` > + 測試時發現會有因 iss 有連續兩個空白而取到空字串的問題，因此追加第 8~10 行的檢測。 > + 因 O(1) 的插入、尋找、更新而使用 `std::unordered_map` (實做為 hash table)。此時，`words` 內的單字由 index 0 開始為正序。計算完成單字數量後，接著進行排序。原先的實作為單純使用 `std::sort` 後搭配 `words` 中的單字順序塞進回傳的 `std::vector` 中，然而這個做法的時間複雜度為 $O(N\log{N})$。完成後發現 Hint 中寫到**可以利用 [bucket sort](https://www.cs.usfca.edu/~galles/visualization/BucketSort.html) 達成 $O(N)$ 的時間複雜度，同時維持 linear 空間複雜度**，於是試著實作： 1. 依照 `words` 的單字順序將各個出現次數分別 `push_back()` 至 `std::vector<std::vector<std::vector<string>>> s[max + 1]` 中。 + 題目要求以字串形式儲存出現次數，因此 `std::to_string()`。 3. 由於是根據 `words` 的順序加入，以倒序自 `s` 取出便可以維持 input 的單字出現順序。 ```cpp= // Move to vector (bucket) vector<vector<vector<string>>> s(max + 1); for(auto w = words.begin(); w != words.end(); ++w){ std::vector<std::string> v; v.push_back(*w); v.push_back(std::to_string(m[*w])); s[m[*w]].push_back(v); } // Sort vector<vector<string>> ret; for(int i = max; i > 0; --i){ for(auto v = s[i].begin(); v != s[i].end(); ++v){ ret.push_back(*v); } } return ret; ``` ### 預測問題 + 可能出現的標點符號有？ + 像 "You're" 這種標點在中間的單字如何處理？ ### 模擬面試 (Schedule: 01/12 18:00) 途中發現 interviewer 並沒有**非常**熟悉 C++ 與字串處理相關的語法，因此會出現不少問題，尤其是滿好用的 builtin-function 們: `std::isspace`，`std::isalpha`，`std::to_string` 等等，然而並非非知道不可的 function，因此等到 compile 環節才要求他修正。並且我忘記在途中告訴他在拆分字串時很好用的技巧： ```cpp= std::isstringstream iss("One Two Three"); std::string str << iss; while(!str.empty()){ std::cout << str << ", "; str << iss; } /* Output: One, Two, Three*/ ``` 但是已在事後以評論告知。 #### 實際被問的問題： + 像 “You’re” 這種標點在中間的單字如何處理？ + 如果出現在中間，直接移除並視為同個單字。例如，"don't" $\rightarrow$ "dont" + C++ 有哪些字元轉換的函式？ - `std::isalpha`，`std::isspace`，`std::to_lower` 等等，同時建議他可以去看一下，畢竟字串處理很常見。 ## Coding Interview - As an Interviewee ### Preperation 稍微複習了一下各 STL 的使用法。 ### 模擬面試 :::success <span style="position: absolute; left: 50%; transform:translateX(-50%); font-size: 20px;">**Find The Duplicates**</span> <div style="height: 20px"></div> Given two sorted arrays `arr1` and `arr2` of passport numbers, implement a function `findDuplicates` that returns an array of all passport numbers that are both in `arr1` and `arr2`. Note that the output array should be sorted in an ascending order. Let $N$ and $M$ be the lengths of `arr1` and `arr2`, respectively. Solve for two cases and analyze the time & space complexities of your solutions: + $M ≈ N$ + The array lengths are approximately the same. + $M ≫ N$ + `arr2` is much bigger than `arr1`. **Examples:** - **input:** `arr1 = [1, 2, 3, 5, 6, 7], arr2 = [3, 6, 7, 8, 20]` - **output:** `[3, 6, 7]`, since only these three values are both in arr1 and arr2 **Constraints:** + [time limit] 5000ms + [input] `array.integer arr1` $1 ≤$ `arr1.length` $≤ 100$ + [input] `array.integer arr2` $1 ≤$ `arr2.length` $≤ 100$ + [output] `array.integer` ::: ### 復述與舉例在聽完後首先向 interviewer 確認： + `arr1`，`arr2` 為升序。接著舉例，然而即使已經確認過，可能因為緊張還是舉了一個不正確的例子，自己看起來觀感有點糟 :cry: ### 提出思路首先，（可能是因為被上一題影響？）馬上想到了可以用 `unordered_map` （也就是一個 hash table）來記錄 `arr1` 的內容，便可以利用平均時間複雜度 $O(1)$ 的特性與 `arr2` 進行比對，但是這個方法完全忽視了 input 已經排序好的事實，因此我便出現了『有排序？可以用 `set`！來維持以排序好的狀態』的奇怪邏輯，便將上述的 container 改成 `set`。此時 interviewer 告訴我：『的確時間複雜度是 $O(N)$，但你可不可以想出一個空間複雜度除了回傳的 `vector` 以外，是 $O(1)$ 的解法？』幫助我靜下心來仔細思考。此時，我才發現這種題目很適合使用 two-pointers 的技巧，於是提出了以下的算法： 1. 將兩個指針 `ind_1`，`ind_2` 分別指向兩個陣列的第一個成員。 2. 將指向的成員進行比較， + 如果 `arr1[ind_1] > arr2[ind_2]`，將 `ind_2` 往後移。 + 如果 `arr1[ind_1] < arr2[ind_2]`，將 `ind_1` 往後移。 + 如果 `arr1[ind_1] == arr2[ind_2]`，將其記錄起來，並將兩個指針都往後移。 4. 重複步驟 2 直到某個陣列被完全讀完。 ### 撰寫 Code 邊說明邊撰寫幾乎讓我腦袋都沒在動，照著上述的算法寫出了下面亂七八糟的東西： ```cpp= vector<int> findDuplicates( const vector<int>& arr1, const vector<int>& arr2) { int ind_1 = 0, ind_2 = 0; vector<int> ret; while(ind_1 < arr1.size() && ind_2 < arr2.size()){ if(arr1[ind_1] == arr2[ind_2]){ ret.push_back(arr1[ind_1]); continue; } while(arr1[ind_1] < arr2[ind_2]){ ++arr1; continue; } while(arr1[ind_1] > arr2[ind_2]){ ++arr2; continue; } /* * ...省略 * / } } ``` 經過 interviewer 題點才修改出最後的成品： ```cpp= vector<int> findDuplicates( const vector<int>& arr1, const vector<int>& arr2) { int ind_1 = 0, ind_2 = 0; vector<int> ret; while(ind_1 < arr1.size() && ind_2 < arr2.size()){ if(arr1[ind_1] == arr2[ind_2]){ ret.push_back(arr1[ind_1++]); ind_2++; } else if(arr1[ind_1] < arr2[ind_2]){ ++ind_1; } else{ ++ind_2; } } return ret; } ``` ### 時間／空間複雜度分析考慮到兩個狀況（`arr1.length()`$=M$，`arr2.length()`$=N$）： + $M ≈ N$ + 適用上述的解答，由於所有成員都必須被遍歷一次，時間複雜度為 $O(M+N)$；只宣告了兩個額外的指針，空間複雜度不受 input 大小影響，為 $O(1)$。 + $M ≫ N$ + 這種狀況，適用『在 `arr1` 中用 binary search 搜尋 `arr2` 的成員』的方法。使用這個方法，由於 binary search 只需要 $O(\log{M})$ 的時間複雜度，時間複雜度為 $O(N\log {M})$，但由於 $N$ 極小，可以視為 $1$，因此時間複雜度為$O(\log{M})$。</br></br>空間複雜度與前者相同，僅需宣告兩個額外的指針，空間複雜度不受 input 大小影響，為 $O(1)$。 ### 反省 + 不應該詢問是否開始撰寫程式碼，太急了反而會讓自己忽視很多東西。 + 撰寫程式碼的時候靜下心來，否則會像這次一樣寫出亂七八糟的東西，忘記**一次做一件事情**的原則。 ![](https://i.imgur.com/AN7mL8e.png) ## Behavior Questions ### Preperation 先配對一次試試水溫，第一次回答除了自我介紹的題目，了解一下問題內容，未錄影。身為 Interviewer 問 Set 9 身為 Interviewee 被問 Set 4 :::success <span style="position: absolute; left: 50%; transform:translateX(-50%); font-size: 20px;">**Behavior Question Set 9**</span> <div style="height: 20px"></div> + Tell me a bit about your professional experience in 2-3 minutes. + What are you looking for in your next role? + What’s an example of a difficult problem you solved? Be specific about how the problem was diagnosed and your process for approaching it. + What’s an example of when you demonstrated leadership or ownership? + Tell me about a challenging interaction with a team member. How do you handle disagreements with coworkers? + Apart from professional knowledge, what did you learn in your last role? ---- <span style="position: absolute; left: 50%; transform:translateX(-50%); font-size: 20px;">**Behavior Question Set 4**</span> <div style="height: 40px"></div> + Tell me a bit about your professional experience in 2-3 minutes. + What are you looking for in your next role? + What’s an example of a difficult problem you solved? Be specific about how the problem was diagnosed and your process for approaching it. + Speak about a project you’re most passionate about or one that you felt exhibited your best work. + Your manager comes into your office and gives you an assignment and asks for an estimate of how long it will take. You give them an estimate and they accept it. Halfway through the project you realize that you’ve underestimated the time it’ll take; it’ll actually take three times as long as you originally told your manager. What do you do? ::: ### 模擬面試被問上方 Set 9 的問題集，列出回答的大綱： + **Tell me a bit about your professional experience in 2-3 minutes.** + CS 專攻的 4 年級學生 + 對 low-level programming，embedded system 和 linux kernel 有興趣 + 也有接觸並進行過 web-programming，computer graphics 的小規模專案 + 專業外也對語言學習有興趣，能以英語、中文、日文流暢溝通及翻譯 + **What are you looking for in your next role?** + 與學校相比能夠在不影響成果發揮一點自由 + 能開放的溝通的工作環境 + **What’s an example of a difficult problem you solved? Be specific about how the problem was diagnosed and your process for approaching it.** + 在電腦影像進行液體模擬的專案時，研讀物理論文並將其擴展再轉換成程式的經驗。 + **What’s an example of when you demonstrated leadership or ownership?** + **Tell me about a challenging interaction with a team member. How do you handle disagreements with coworkers?** + 進行專案時**討論期間不出現也不參與的成員**以及**不太擅長但很努力的成員**之間的相處方式。 + **Apart from professional knowledge, what did you learn in your last role?** + 團體作業中與不同人合作及溝通的方式