Sorting HW2 - HackMD

--- title: 'Sorting HW2' --- # **Sorting HW2** ###### tags: `sorting` `python` Q1. 傳統的 Quick sort 是 1 個 Pivot、2 個 Partition，若改為 2 個 Pivot、3 個 Partition，演算過程要如何調整？演算速度是否會變快？ --- Q2. 如果數值 1~1000，共 1000000000 個整數，如果要排序，該如何設計？ --- **Algo** 首先假設數列無法全部放入 memory，因此無法全部讀入再排序。同理，輸出也一樣需要分段輸出，依排序寫入到同一個 file。由於數列內容為純數值，且 K（位數 = 3）<< N（資料個數 = $10^9$），所以適合 Counting Sort。演算法步驟： 1. 先產生一組隨機 1~1000，共 1000000000 個整數的檔案作為測資，每個數之間以「，」分開。 2. 宣告一個 array（Count），大小為 1001 用來記錄 1~1000 數字個別出現多少次。 3. 宣告一個 String（num_str）用來記錄目前讀到的數字的字串。 4. 每次只讀取一個字元，while 讀取直到讀到空字元停止。 5. 如果讀到非「，」字元則將該字元加到當前讀到的數字字串（num_str）中，代表讀取的是同一個數字。 6. 如果讀到「，」字元，代表目前記錄的數字字串是要排序的數字。將 num_str 轉成整數，並在對應的 Count 位置 +1。 7. 讀完之後，從小到大依據 Count 紀錄的個數字出現次數寫入 output file，一次只寫入一個數字。 8. 寫一個 check_answer.py 執行步驟 3. 4. 5. 6.，不同的是每次只比較當前數字是否大於等於上一次的數字，並在最後統計個數是否等於 $10^9$。 **Code** Q2_SortingTesting.c ```c= #include <stdio.h> #include <stdlib.h> #include <time.h> #define RANGE 1000 #define SAMEPLE_TIMES 1000000000 void randomArray(int arraySize, FILE* output_file) { int i; int x; x = rand() % arraySize; fprintf(output_file, "%d", x); for(i = 1 ; i < SAMEPLE_TIMES ; i++) { x = rand() % arraySize + 1; fprintf(output_file, ",%d", x); } } int main(int argc, char *argv[]) { int i; int numArray[RANGE]; const char* file_name = "SortingTesting3.txt"; FILE* output_file = fopen(file_name, "a+"); srand(time(NULL)); randomArray(RANGE, output_file); fclose(output_file); return 0; } ``` Q2_Sort.py ```python= INPUT_FILE_PATH = "./SortingTesting_Q2_test.txt" OUTPUT_FILE_PATH = "./SortedData_Q2_test.txt" if __name__ == "__main__": count = [0] * 1001 in_f = open(input_file_path, "r") readChar = in_f.read(1) num_str = "" + readChar while(readChar != ''): readChar = in_f.read(1) if(readChar == ","): count[int(num_str)] += 1 num_str = "" elif (readChar == ''): break else: num_str += readChar count[int(num_str)] += 1 in_f.close() out_f = open(output_file_path, "a+") for i in range(1001): if count[i] > 0: out_f.write(str(i)) count[i] -= 1 break for i in range(1001): while count[i] > 0: out_f.write("," + str(i)) count[i] -= 1 out_f.close() ``` Q2_check_answer.py ```python= SORTEDDATA_FILE_PATH = "./SortedData_Q2_test.txt" NUMBERS = 1000000 if __name__ == "__main__": in_f = open(SORTEDDATA_FILE_PATH, "r") l = 0 lastNum = 0 readChar = in_f.read(1) numStr = "" + readChar while(readChar != ''): readChar = in_f.read(1) if(readChar == ","): l += 1 if int(numStr) < lastNum: print("FAIL AT " + l + "TH NUMBER") break if l > (NUMBERS): print("TOO MANY NUMBERS") break lastNum = int(numStr) numStr = "" elif (readChar == ''): break else: numStr += readChar l += 1 if int(numStr) < lastNum: print("FAIL AT " + l + "TH NUMBER") elif l < (NUMBERS): print("LOST NUMBERS") elif l > (NUMBERS): print("TOO MANY NUMBERS") else: print("SUCCESS") in_f.close() ``` Q3. 如果全球統一了，身分證規格相同，都是一個字母加上 12 碼 0~9 的數字，要排序，該如何設計？ --- 情境假設 : 固態硬碟1TB以上 (75億筆字串約110GB) 記憶體32G (1億筆以物件儲存約7G) 執行步驟 : 1.原檔案約110GB。 2.考量到記憶體使用，批次讀取檔案每次一億筆。 3.以一億筆資料為一個Chunk使用radix sort 排序。 4.排序完寫入暫存檔案，並釋放記憶體。 5.重複步驟2~5，直到所有資料被排序，並產生n個chunk檔案。 6.對n個chunk檔案使用merge sort並寫入最後的檔案。 ```python= import pandas as pd import numpy as np import random import sys import os id_list=[] for i in range (1500): id_list.append(i) random.shuffle(id_list) file_object = open('sample.csv', 'a') for i in range(len(id_list)): file_object.write(str(id_list[i])) file_object.close() def read_in_chunks(filePath, chunk_size=10*1): file_object = open(filePath) while True: chunk_data = file_object.read(chunk_size) if not chunk_data: break yield chunk_data print(chunk_data) ``` Q4. 呈 2，如果數量不變，但數值改成長度不固定的字串（長度至少 5），要排序，該如何設計？ ---