--- tags: 2023DS --- # 2023 Data Structure - Final Project ## Score and Requirement | Term | Score | |:------------:|:-----:| | Problem 1 | 40% | | Problem 2 | 32% | | Coding Style | 18% | | Report (**pdf**) | 10% + 5% | Each group has to compress the relevant files for this project and submit the compressed file onto Elearning. Each group’s compressed file has to contain (1) All the source codes (2) The final report (no more than 20 pages in word file). In this report, you have to describe how you solve the problem and what results (可用截圖呈現) you obtain for each question. In the report, also make sure to describe each student’s role in the final project. In this project, when you have to find the maximum, minimum and median values, the requirement is that you have to **implement C++ classes for Heapsort and use these classes for sorting** (**you have to write C++ classes, rather than just using C language for this project**) (**實作出heapsort 的 C++ classes 並用其做 sorting 為此專題之基本要求**). Additionally, you may use other libraries to solve the problem for comparison (可額外加分). * Number of people per group: **5 ~ 6** * Time for demo per group: **15** minutes * Demo date: * 6/16, Fri, all groups * 報告應包含內容 1. 回答所有問題 2. **專題開發過程 (Must)** 3. **效能分析 (Optinal)** * 有關程式開發、報告配分: * 程式開發 - 18% * 基本 heapsort:10% * 使用其他資料結構:8% * 報告 - 10% + **5%**(TA 追加) * 回答 Prob 1, 2 - 已另外計分 * 講述開發過程:10% * 效能分析:**5%** ## Problem 1 (40 %) ### Single financial product problem -- daily prices of TWII (TSEC weighted index, 臺灣加權股價指數) * Dataset: TWII_withRepeatedData.xlsx The dataset contains data from 5 columns as follows: | Column 1 | Column 2 | Column 3 | Colymn 4 | Column 5 | | -------- | ---------- | ---------- | --------- | ----------- | | Date | Open_price | High_price | Low_price | Close_price | #### Please write C++ codes for the following tasks: ### Task A (20%, 2% per subquestion) (1) Determine how many unique dates are in the dataset. After removing the data from repeated dates, please use the closing price (i.e., the column for “Close_price”) to solve (2) – (9): (2) Find the 10 smallest prices and which dates contain these smallest prices. (3) Find the 10 largest prices and which dates contain these largest prices. (4) Find the median price and its occurring date (5) Compute the daily return for every day (except the first day). Then determine what the maximum and minimum returns (return could be a negative value) are and on which day(s) they occur. Daily returns is defined as: $[\frac{P(t+1)-P(t)}{P(t)}]\cdot100\%$, where $t$ is date. X.XXXX % (6) Compute the intraday return for every day. Then determine what the maximum and minimum returns (return could be a negative value) are and on which day(s) they occur. Intraday returns is defined as: $[\frac{Close\_price(t)-Open\_price(t)}{Open\_price(t)}]\cdot100\%$, where t is date. X.XXXX % (7) Make a **plot** of the closing price over time, in which the x-axis is the **day index** and y-axis is the price. (8) Make a **plot** of the daily return over time, in which the x-axis is the **day index** and y-axis is the daily return. (9) Make a **plot** of the intraday return over time, in which the x-axis is the **day index** and y-axis is the intraday return. (10) Find the maximum, minimum and median prices using all the 4 columns of prices (i.e., Open_price, High_price, Low_price and Close_price) and determine on which date they occur. * Hint: * 假設整理後,共有 6100 天,則總共有 $6100 * 4 = 24400$ 個價錢,要從這 22400 個價錢查詢最大、最小、中位數 * ~~若是偶數天,~~ 中位數請寫出「中間兩個」的價錢(**因為必定是偶數個價錢**) ### Task B (20%, 2% per subquestion) For the same dataset used in Task (A), after removing the data from repeated dates, please generate a set of **sampled** data at the interval of every five days, then do the same tasks for (1)-(10) of Task (A). * Hint: * 題目敘述為「取樣 (**sampled**)」,意指應「每五天**選**一天」 * 請固定選每五天的「第一天」 ## Problem 2 (32 %) #### Multiple financial products problem -- tick-based price data of options (選擇權) * Datasets: * OptionsDaily_2017_05_15.csv * OptionsDaily_2017_05_16.csv * OptionsDaily_2017_05_17.csv * OptionsDaily_2017_05_18.csv * OptionsDaily_2017_05_19.csv Each dataset contains data from 9 columns as follows: | Column 1 | Column 2 | Column 3 | Column 4 | Column 5 | Column 6 | Column 7 | Column 8 | Column 9 | | :----------: | :----------: | :----------: | :----------------: | :----------: | :-----------: | :----------: | :------------------: | :-------------: | | 成交日期 | 商品代號 | 履約價格 | 到期月份(週別) | 買賣權別 | 成交時間 | 成交價格 | 成交數量(B or S) | 開盤集合競價 | A financial product is defined by a unique combination of columns 2 to column 5. I.e., if a value generated by combining the values of column 2, 3, 4, and 5 is unique, then it defines a unique product. For example, in **OptionsDaily_2017_05_17.csv, the 3rd row contains the following data:** **Column 2 = CAO Column 3 = 70 Column 4 = 201706 Column 5 = P** **So the product “CAO_70_201706_P” is a unique options product.** Please write C++ codes by OOP coding style for the following tasks: (1) Determine, totally, how many unique products exist in all these five datasets. **(3%)** (2) Determine if the product **TXO_1000_201706_P** exists in these datasets. **(3%)** (3) Determine if the product **TXO_9500_201706_C** exists in these datasets. **(3%)** (4) Determine if the product **GIO_5500_201706_C** exists in these datasets. **(3%)** (5) For **TXO_9900_201705_C**, * a. Find the 10 smallest prices and what time (determined by the unique combination of Column 1 (成交日期) and Column 6 (成交時間)) these smallest prices show up. **(5%)** * b. Find the 10 largest prices and what time these largest prices show up. **(5%)** * c. ***Find the median price of this product.*** **(5%)** * d. Compute the ticked-based return (except the first tick) using the values in Column 7 (成交價格). Then determine what the maximum and minimum **returns** are and when they occur. **(5%)** * Hint: * tick? * [tick difinition](https://rich01.com/what-is-1-tick/) * [什麼是 tick?](https://www.investopedia.com/terms/t/tick.asp) * Definition of **tick-based return**: $\frac{P_{t} - P_{t - 1}}{P_{t - 1}} \cdot 100\%$ * $P_t$: * the price of a certain trade. * 某一筆交易價錢 * $P_{t - 1}$: * the price of the previous trade (before the "certain" trade.) * 「某一筆」之前的一筆交易價錢 * $t$: * a certain time * 某一時間點 * 使用尚未取的原始資料,去處理 5 a) ~ c) * [重點]:意指即使有兩列的資料**完全一模一樣,仍要保留下來** ### Hint 第二大題,第四欄的 201705W4 不會影響計算商品種類的數量 結論:第四欄有 201705W4 一類的資料是正常的,請根據原本的資料去處理第二大題 ## Hint about programming and demo * 是否不能使用 C++ 內建的資料結構函式庫(STL)? * 建議自己寫(自己撰寫分數高,使用到 STL 分數不高) * 報告應包含內容? 1. 回答所有問題 2. **專題開發過程 (MUST)** 3. **效能分析 (Optinal)** * demo & 簡報應展示內容? * 用何種資料結構解決、**不要講答案**,但要說明如何解決 * Python 畫圖可以! * 說明每一個組員的工作內容 * Prob 2 的資料,第九欄全為空? * 空的欄位正常、不用處理 * 資料的每個欄位是否都會使用到? * 有可能不需要某些欄位 * 注意 1 ~ 2 人報告簡報、Demo即可 --- ### Git & Github ### Python ```matplotlib``` ### 有關編譯器(建議) 因為每個人習慣的開發環境不同,編譯器也會不同 * Visual C++ (Visual Studio) * GNU GCC * Clang 不同的編譯器會導致產生的機器碼不同,執行同一份程式會有落差,建議確保所有人使用同一款編譯器 #### 編譯器「最佳化」 請統一討論好要大家一起使用 ```-O0, -O1, -O2, -O3```,但 TA 建議 ```-O0``` 即可(即不使用最佳化) 因為最佳化會__________ (又是複數理由),可以開,但請固定開一種等級的最佳化。**更好的情況是了解一點最佳化的事情再使用最佳化** ### 有關 6/16 的報告 * 可以貼程式碼嗎? * 可以,但**建議只附上「關鍵程式碼」** * **大部分時間應說明解決問題的過程** * 建議**說明與程式碼**的比例 * **9 : 1** * 注意其實大多數資訊系教授(不只高大),大多數已脫離寫程式碼的階段(著重在教授自己專業領域、論文發表),早已忘記寫程式的基本(更何況程式語言上百種),但會在意的是學生「**發現 & 解決問題**」 * 報告參考項目(**注意此項目及建議時間僅供參考,並非盲目照著下列所述事項報告**) * 如何分工? * 如何分析資料集的內容 * 解決第一大題的過程(7 ~ 8 mins) * 解決第二大題的過程(7 ~ 8 mins) * 分別說明兩大題使用何種資料結構解決 * 使用某資料結構的考量點、關鍵 * 遭遇困難 * 效能分析 ``` |-DS_FinalProject_第X組.zip (.rar) | |-promblem1 | |-promblem2 | |-report.pdf | |-slide.pptx or slide.pdf | |-(And other files...) ``` ### 效能分析 SAMPLE * Problem 1 ![](https://i.imgur.com/DV8xtxv.png) ![](https://i.imgur.com/f8wsREQ.png) ![](https://i.imgur.com/s2BFFiO.png) ### 關於 Word 報告跟 PPT 簡報 * 繳交時,都請轉成 **pdf** 檔後再上傳 * 不論簡報或是報告,都**請記得標記頁碼** * 報告內容原則上的篇幅上限為 20 頁 * 封面、目錄(可能共 2 頁)可不計入計算,**頁數只計算實質內容(作答、專題開發過程)** * 若有涉及**效能分析**,**可以超過 20 頁** * 撰寫內容參考已在 moodle 上放置報告範例 (但以 Google doc 撰寫,請製成 Word 檔後轉 PDF),請前往參閱。 * **可貼上關鍵程式碼來說明,但程式碼不應佔據太多的篇幅** * 開發過程的撰寫細節: * [Important] 說明分析資料集、使用到的資料結構、設計/使用考量等 * [Important] 說明如何以 C++ 解題 * 驗證過程 (Optinal) * 大多數人或許會以 Python 來驗證答案,若還有篇幅可以說明如何撰寫 Python 程式碼來驗證答案 * 注意**不應**在 Word/PPT 裡貼上程式碼的同時,又**附加註解** * Source code 內可附上註解 * **若想在 Word 報告貼上關鍵程式碼說明,請移除註解,並在 Word 內文說明其功用** * **PPT 同理,若貼上程式碼,請以少量文字說明或以口頭說明** * 效能分析的細節: * 這部分是助教**追加的加分內容**,試比較不同的資料結構間,對於解題的幫助,並分析時間、空間上的差異性 * 如何分析時間、空間由各組自行決定,**而時間差異應容易以圖表視覺化呈現出來**,可參考前述 **初步效能分析SAMPLE** 來製圖 * **注意SAMPLE的分析時間的方式必定與各組有所差異,不應盲目照著上述的方式走** * 配分的細節: * 除了 Word 檔報告內容外,助教會根據各組的**工作分配**,對每個人的**分數做一點調整**。所以雖然報告以組為單位繳交,每個人的分數仍可能有些許差異