2017q3 Homework2 (作業區)

C04 : Prefix-search

tags: `jserv`, `homework`

Reviewed by `yusihlin`

Github 上的 commit 內容有誤，語法可以再加強
共筆中有多處錯字，建議每次更新後重新瀏覽修改的部分
"針對 Tries 做解釋"這部分既然提及與題目相關連度高，應同時提及與此份作業程式碼的何處相關聯

開發環境

Architecture:          i686
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數：2
每通訊端核心數：2
Socket(s):             1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              37
Model name:            Intel(R) Core(TM) i5 CPU         650  @ 3.20GHz
製程：              5
CPU MHz：             1200.000
CPU max MHz:           3201.0000
CPU min MHz:           1200.0000
BogoMIPS:              6383.71
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           4096K

請先列出開發環境
課程助教

已補上謝謝williamchangTW

一開始的準備

processing problem

一開始遇到的問題：當我 make 時，在終端機出現以下問題：






test_cpy.c: In function 'main':
test_cpy.c:124:17:error: 'for' loop initial declaration are only allowed in C99 mode
                        for (int i = 0; i < sidx; i++)

test_cpy.c:124:17: note: use option -std=c99 or -std=gnu99 to compile your code
make: *** [test_cpy.o] Error 1

而我去找了資料後給了兩個方法，問題出現在 test_cpy.c & test_ref.c：
- gcc -std=c99 test_cpy.c
- gedit test_cpy.c file & add line below


int i;
for (i = 0; i < stdx; i++)

我選擇後者，先宣告int i在做for loop

Ternary search tree (TST)

再來我先看了 ternary search tree ，針對其定義做一些重點標記：
- 最多三個子代
- 是 tries的一種
- lower than Prefix-search but more space-efficiency
- Insertion random order 常常造出 well-blanced tree
- 跟 BST 比， TST 以速度為代價，搜尋空間的效率更高，但不像 tries 一樣浪費空間又有其速度快為優點
- Hash tables 不允許多個 ternary-search-tree 因為 hash maps 經常性的使用較大的記憶體範圍但不像 tries 一樣多，因為 hash maps 需要把整個資料都搜尋過一遍，不像 TST 一樣只需搜尋較少的字元，相對較慢
- 可用 Associative map 結構，有效率的解決增長的 String search algorithm
- 可以用在 spell checker & auto-completion
- 與程式碼對照，這裡插入一點解釋針對三個 child 出現的地方








//tst.c file
typedef struct tst_node {
    char key;
    unsigned refcnt;
    struct tst_node *lokid;
    struct tst_node *eqkid;
    struct tst_node *hikid;
} tst_node;

我們預設字母 A~Z 的排序
- lower child < search value => *lokid
- equal child = search value => *erqkid
- higher child > search value => *hikid

Time Complexity

	Average-case running time	Worst-case running time
Lookup	O(logn)	O(n)
Insertion	O(logn)	O(n)
Delete	O(logn)	O(n)

改用 HackMD 的語法重制上圖
"jserv"

已修正，謝謝
"williamchangTW"

spell checker 因為 TST 為 tries 的一種，所以可以找尋相關字比如可以找出錯誤的地方然後取相關字。
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

Auto-completion 這裡很明顯的可以看出他是如何在 TST 上運作的，子代都是其父代的相關聯字
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

找字系統如同 google search
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →

最後做總結，參考了link：

優點：
- 搜尋空間效率更高
- 有 binary search tree 空間利用上的優點以及 trie 搜尋速度上的優勢
- 效率高而且容易实现
- 效率比 hash 快，當資料相當大 collision increase
- 增加及減少資料容易，若 hash 改變大小則要重新操作
- 可以作為字典輸出，trie也能做，不过很费时

針對 Tries 做解釋（跟題目相關連度高）

是一個動態配置或關聯性配置的資料結構，常用於 String
優點：
- 搜尋速度快
- 從父親開始，其子代下去都是以他為開頭的相關性子代，所以相連性強，e.g t(parent)->ta(child)
- 樹根是 empty string
- 是一個 prefix-search of space-optmized presentation
- 可以應用在 hash tables
- 在 worst case 中可達到 O(m)，其中 m 為字串最長長度，比 hash table time complexity batter ，因為 hash table 有 collision problem ，因此 worst case bound in O(n)，顯得比較好一些
- 沒有 collision event
- bucket 的一個值能對應到多個值
- 不需提供 hash function or change，當有更多的值加入時
- 可提供按照字母順序的搜尋
缺點：
- 在一些事件上搜尋的時間較慢，當直接面對 hard disk 做存取時，還有 secondary memory 的 random access time 比較時間較長時
- 有時候空間需求會大於 hash table ，當 memory 要分配給每個字元一個記憶體空間而不是直接分配一塊記憶體給他時
- floating pointer 會造成長鍊的搜尋是無意義的儘管他可以支持 standard IEEE 單精度及倍精度浮點數
- 平均空間需求大

FIXME

先看一開始給的資訊



enum { INS, DEL, WRDMAX = 256, STKMAX = 512, LMAX = 1024 };
#define REF INS
#define CPY DEL

enum 是列舉，用來宣告元素定義整數，根據規範若沒有賦予其值，就是 0->1->2…以此類推，所以目前 INS = 0 = REF ， DEL = 1 = CPY

把 REF 改成 1 觀察其輸出#define REF DEL，明顯出現錯誤
choice : s

suggest[1001] : Ta
suggest[1002] : Ta
suggest[1003] : Ta
suggest[1004] : Ta
suggest[1005] : Ta
suggest[1006] : Ta
suggest[1007] : Ta
suggest[1008] : Ta
suggest[1009] : Ta
suggest[1010] : Ta
suggest[1011] : Ta
suggest[1012] : Ta
suggest[1013] : Ta
suggest[1014] : Ta
suggest[1015] : Ta
suggest[1016] : Ta
suggest[1017] : Ta
suggest[1018] : Ta
suggest[1019] : Ta
suggest[1020] : Ta
suggest[1021] : Ta
suggest[1022] : Ta
suggest[1023] : Ta

只有輸出 fscanf,%S,word 所取得的長度，很明顯的沒有傳入整串字串給其輸出

原本輸出應為

suggest[947] : Taylorsville,
suggest[948] : Taylorville,
suggest[949] : Tayoltita,
suggest[950] : Taypano,
suggest[951] : Tayport,
suggest[952] : Taysan,
suggest[953] : Tayshet,
suggest[954] : Taytay,
suggest[955] : Taytayan,
suggest[956] : Taytsy,
suggest[957] : Tayturka,
suggest[958] : Tayu,
suggest[959] : Tayud,
suggest[960] : Tayug,
suggest[961] : Tayum,
suggest[962] : Tayuman,
suggest[963] : Taywanak
suggest[964] : Tayzhina,
suggest[965] : Taza,
suggest[966] : Tazacorte,
suggest[967] : Tazewell,
suggest[968] : Tazlău,
suggest[969] : Tazoult-Lambese,
suggest[970] : Tazovskiy,

從錯誤的地方開始著手，choice : s 的時候發生錯誤


for (int i = 0; i < sidx; i++)
    printf("suggest[%d] : %s\n", i, sgl[i]);

可見它是輸出 sgl[i] 內的內容，而 i 是它找到的數字內有的，它是關聯性記憶體配置，所以在附近的都是關聯性教高的

查找 sgl[] 內的資料如何生成，不應該指現單一字頭，往上找 tst_ins_del 函式，在 tst.c 檔中發現一段程式碼中出現了 CPY

在上述中有提到 CPY 的值可能為 1 or 0，若我要跑 CPY 程式則設 1 ，反之則 0，這邊的 CPY 是由 test_ref.c 所提供，所以我把程式裡的 CPY changed to REF












 if (*p++ == 0) {
            if (cpy) { /* allocate storage for 's' */
                const char *eqdata = strdup(*s);
                if (!eqdata)
                    return NULL;
                curr->eqkid = (tst_node *) eqdata;
                return (void *) eqdata;
            } else { /* save pointer to 's' (allocated elsewhere) */
                curr->eqkid = (tst_node *) *s;
                return (void *) *s;
            }
        }

上述程式中，若為 CPY = 1 則會給它一塊記憶體空間存取值，若不為 1 則會走另外一個判斷式，因此錯誤在這個地方

先找出 FIXME 標注的地方(共3處)
把 CPY 換成 REF 可以正常輸出











t1 = tvgetf();
while ((rtn - fscanf(fp, "%s", word)) != EOF) {
    char *p - word;
    /*FIXME: insert reference to each string*/
    if (!tst_ins_del(&root, &p, INS, CPY)) {
        fprintf(stderr, "error: memory exhausted, tst_insert.\n");
        fclose(fp);
        return 1;
    }
    idx++;
}

找到 tst.h 中有他宣告的格式，int cpy維持不變


void *tst_ins_del(tst_node **root, char *const *s, const int del, const int cpy);

參考其註解，得知以下重點：
- 其中 's' 為從 scanf 中取得，存入word中之值
- 有兩個動作一個刪除一個增加
- 一開始存入所有點到 leaf
- 若 'del' 不為空值則刪除該值
- 'cpy' 若沒有配置記憶體給 's' 否則 save pointer to 's'，若已存在該字串，則增加 node->refcnt 然後回傳該值
refcnt 在以下兩個檔案中可以看見，把 node 指向 refcnt







/*tst.h*/ 
unsigned tst_get_refcnt(const tst_node *node);
/*tst.c*/
unsigned tst_get_refcnt(const tst_node *node)
{
    return node->refcnt;
}

test_cpy & test_ref compare

在還沒做出任何改變時兩個檔案效能資訊，沒有太大的差異
perf stat ./test_ref

225.444266      task-clock (msec)         #    0.024 CPUs utilized          
                25      context-switches          #    0.111 K/sec                  
                 2      cpu-migrations            #    0.009 K/sec                  
             6,615      page-faults               #    0.029 M/sec                  
       773,503,788      cycles                    #    3.431 GHz                      (82.26%)
        67,672,385      stalled-cycles-frontend   #    8.75% frontend cycles idle     (83.88%)
       304,831,532      stalled-cycles-backend    #   39.41% backend  cycles idle     (34.86%)
       549,998,612      instructions              #    0.71  insns per cycle        
                                                  #    0.55  stalled cycles per insn  (50.80%)
       108,195,625      branches                  #  479.922 M/sec                    (66.50%)
         3,064,576      branch-misses             #    2.83% of all branches          (82.95%)

       9.289633678 seconds time elapsed

perf stat ./test_cpy

226.348106      task-clock (msec)         #    0.034 CPUs utilized          
                22      context-switches          #    0.097 K/sec                  
                 2      cpu-migrations            #    0.009 K/sec                  
             6,614      page-faults               #    0.029 M/sec                  
       745,684,917      cycles                    #    3.294 GHz                      (82.32%)
        64,710,984      stalled-cycles-frontend   #    8.68% frontend cycles idle     (82.35%)
       301,841,280      stalled-cycles-backend    #   40.48% backend  cycles idle     (35.89%)
       540,944,385      instructions              #    0.73  insns per cycle        
                                                  #    0.56  stalled cycles per insn  (53.77%)
       109,386,057      branches                  #  483.265 M/sec                    (68.70%)
         2,994,135      branch-misses             #    2.74% of all branches          (84.59%)

       6.668136718 seconds time elapsed

文字訊息請避免用圖片表示
"jserv"

好的老師，已經全數換成字來顯示資訊
"williamchangTW"

城市與國家問題

觀察 cities.txt file，Taipei , Tiawan有一件事可以作為區隔城市與國家，當中用"，"區隔開來，想法是可以利用此分割點作為讀區到此點及返回找城市，或從此點開始對後面做搜尋找國家

Prefix-search changes where is not suitable

使用過後大概題一下所有的功能簡介:
- a 隨意加入你想加入的資料
- f 一定要打完整的字且大小寫有分，因為在 ASCII 中，英文字母的大小寫呈現方式不一樣，所以必須區分大小寫，值得提醒的一點無法找城市，只能找國家
- s prefix-search 的找，無法區分城市跟國家，找出所有相符的資料出來
- d 隨意的刪除資料
- q 結束

why used forward declaration

- Forward Declaration 根據其中提到的觀點，可以了解到為什麼要使用這個技巧，因為可以避免不必要的重新定義，當專案大到一定的程度時，會嚴重拖延編譯的時間，影響效能

這說法太片面，而且事實上編譯器技術進步很多，上述考量已經不是重點了
用物件導向和軟體模組化的觀點來回覆

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

jserv

在研讀你所部知道的C語言：物件導向篇得了以下幾點的重點及吸收：
- Encapsulation在此篇了解到，物件導向是一種包裝技術，將實體細節隱藏起來而不去讓使用者更動它
- Alias這個名詞曾經在白算盤書中看到，在那裡稱之為"Aliasing problem"簡易而論，也就是資訊不對稱，有可能同時存取同樣記憶體位置，當其中一方改變，而另一方拿取錯誤的資訊，造成不如預期的錯誤稱之

列出論文時，應該明確提及標題 (+超連結) Glib-C: C as an alternative Object OrientedEnvironment。論文閱讀和內文摘要是大學生很重要的訓練，可試著先把重點列出 (就像你已經做的事) 並透過實例解說和討論
"jserv"

已更正此部份所發生的不明確問題，並於其他地方發生過類似問題的地方做修正
"williamchangTW"

在詳讀 Glib-C: C as an alternative Object OrientedEnvironment 頁數介於 3~13 對於 OO 詳細解釋後歸內出以下幾點
- Objected-Oriented is sort of trap.
- Encapsulation ,objects, classes, inheritance and polymorph-ism may be regarded as the essential basic for OO.
- Encapsulation and data hiding is important for OO.
- An object is described by a class, where the class describes a type. So an object is an instance of such a type.
- Inheritance is concept in which a type can be extended: a type can be specialized.
- Polymorph-ism in the context of traditional OOPLs is the possibility of an object to take different forms at runtime
- Exception handling, garbage collecting, bulit-in operator overloading, runtime intrspection and reflection are kind of OO.
- Garbage collection系統在執行中把未用到的記憶體，開發者手動的管理它
- ADT是 OO 的子集合，稱之為 Object Base programming