--- tags: statistics --- # 資料科學入門 (Introduction to Data Science) <font size = +1 color = "gray">原課程名：當統計學與程式相遇 (Learning Statistics & Programming)</font> <center> ![](https://i2.wp.com/www.prosancons.com/wp-content/uploads/2018/04/041518_1320_Prosandcons1.png =400x) </center> <p style="text-align: right"> "The purpose of computing is insight, not numbers."<br>-- Richard W. Hamming </p> <p style="text-align: right"> "Data do not speak for themselves;<br> there is always an interpreter, or a translator."<br>-- John W. Ratcliffe </p> <p style="text-align: right"> "Remember that all models are wrong;<br> the practical question is how wrong do they have to be to not be useful."<br>-- George Box </p> <p style="text-align: right"> "It is easy to lie with statistics, but easier to lie without them."<br>-- Frederick Mosteller </p> <p style="text-align: right"> "Science is more than a body of knowledge;<br> it is a <font color = "red">way of thinking</font>.<br> The method of science, as stodgy and grumpy as it may seem,<br> is far more important than the findings of science."<br>-- Carl Sagan </p> ### 講者訊息 - 盧政良 (Zheng-Liang Lu, Arthur) - 聯絡方式：arthurzllu@gmail.com ### 工作環境 - Google Colab https://colab.research.google.com/ :::warning 本課程不限制程式語言，但課程將以 [Python](https://hackmd.io/@arthurzllu/HJNXq84SO) 作為示範；學員可使用 Excel、R、或者 MATLAB 來進行課程內容，惟須自行尋找對應的工具來完成問題。 ::: ### 預備知識 - 四則運算與代數運算 - 生活經驗與公民道德 - (Optional) 微積分 - 台大數學系朱樺老師 [微積分](http://www.math.ntu.edu.tw/~hchu/Calculus/) - 3Blue1Brown, [Essence of Calculus](https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr) ![](https://i.imgur.com/1ZDNtsh.png =200x) - (Optional) 線性代數 - Stephen H. Friedberg, Arnold J. Insel, Lawrence E. Spence, [Linear Algebra](https://www.amazon.com/Linear-Algebra-5th-Stephen-Friedberg/dp/0134860241/), 5/e, 2018 ![](https://i.imgur.com/Si0JvQb.png =100x) - 3Blue1Brown, [Essence of Linear Algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) on Google Youtube ![](https://i.imgur.com/tnuPzSf.png =200x) :::warning 本課程所牽涉到的數學，目前只需要了解其脈絡與結果，無須擔心推導或計算的細節；我們要將理論結果與程式作連結，這些繁瑣的計算透過 Python 的套件 (或其他平台對應的工具) 交付給強大的電腦完成。注意，並非數學的細節不重要；<font color = "red">數學的細節很重要</font>，但不是當下我們關注的對象。未來時間允許的話，這些數學理論會是你繼續邁向資料科學家的基礎。 ::: ### 學習目標 * 統計學 - 了解統計工具與計算原理 - 正確解釋統計結果 - 合理預測資料的趨勢 - 排除統計謬誤 * 程式能力 - 掌握資料處理流程 - 學習創造自己的工具 <center> ![](https://i.imgur.com/nVXc2TN.png =300x) </center> ### 評分標準 #### 實體課程版 - 成果發表 - 報告重點：提問、資料收集與視覺化、模型假設、實驗結果、結論。 - 分組：原則一人一組，報告以**投影片**或 **jupyter notebook** 進行。 - 完成五次程式作業與期末報告者可以獲頒本課程之證書。 #### 線上課程版 - 完成五次程式作業的學員可以取得本課程之證書。 ### 授課對象 1. 欲學習使用**統計方法**、**量化研究**的大專院校生、相關科研人員與工程師 2. 國高中生可，已學習過基礎統計學者佳 (108 課綱高二的機率與統計I與高三的機率與統計II) <center> ![](https://i.imgur.com/UOBvF5W.png =200x) </center> ## 主要材料 - Steven S Skiena, [The Data Science Design Manual](https://www.springer.com/gp/book/9783319554433), 2017 ![](https://i.imgur.com/JrFu6Hh.png =100x) - 陳旭昇，[統計學：應用與進階](http://homepage.ntu.edu.tw/~sschen/Book/Book1.html)，第三版 [三民書局](https://www.sanmin.com.tw/Product/index/005083419) ![](https://i.imgur.com/SveUJDN.png =100x) ## 課程大綱 0. Python 程式基礎 1. 資料擷取與視覺化 2. 機率論導論與常見的機率模型 3. 統計檢定 4. 點估計與區間估計 5. 大數法則與中央極限定理 6. 回歸模型 7. 時間序列分析 8. 貝氏機率 9. 機器學習簡介 10. 統計實務 <center> ![](https://i.imgur.com/fj7sI9u.png =400x) </center> ## 課程內容 ### Python 程式能力速成 - Python 程式能力速成: [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_0_python_programming.ipynb) - 資料型態與基礎運算 - 有條件的敘述 - 重複的敘述 - 函式 - 與程式能力相關的額外訊息 [pdf](https://www.csie.ntu.edu.tw/~d00922011/python/slides/cs_preliminary_knowledge.pdf) <center> ![](https://i.imgur.com/sMkTwsp.png =300x) </center> ### 資料擷取與視覺化 - Pandas: [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_data_acquisition_visualization.ipynb) - Python data analysis library [link](https://pandas.pydata.org/) - (FYR) https://www.kaggle.com/learn/pandas - (FYR) Cheat sheet: [link1](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf), [link2](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Pandas_Cheat_Sheet_2.pdf) - 資料預處理 - 案例 1: [code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_1.ipynb) - 案例 2: [financial time series](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_2_financial_time_series.ipynb) - 案例 3: [json file](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_3_json.ipynb) - (FYR) 字串處理 - 正規表示法 (regular expression) - 不錯的互動式教學網站 https://regexone.com/ - 原生套件：https://docs.python.org/3/library/re.html - (FYR) Pythonic data cleaning with numpy and pandas [link]() - (FYR) https://www.kaggle.com/learn/data-cleaning - 視覺化 - Matplotlib官方文件 [link](https://matplotlib.org/contents.html) - (FYR) http://scipy-lectures.org/intro/matplotlib/index.html - Cheat sheets by DataCamp: [pdf](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf) - 一個不錯的教學文件 [Nicolas P. Rougier](http://www.labri.fr/perso/nrougier/teaching/matplotlib/) - (FYR) https://www.kaggle.com/learn/data-visualization :::success <b><font size = 4>++Lab 1++</font> 利用 Pandas 計算投資組合報酬率</b> 利用套件 ffn 抓取 FAANG ("fb, aapl, amzn, nflx, goog") 的收盤價，時間範圍為 2020 年第一個交易日到最後一個交易日 (今天)。假設本人於第一個交易日每一檔股票各買入一元美金，採取買了抱著不動 (buy-and-hold) 的策略，一直持有這五個資產到今日。請問資產價值之累積報酬的時間序列為何? 請將自己持有部位的價值之累積報酬的時間序列繪製成線圖，並將五個資產的價格時間序列合併自己部位的價值時間序列輸出到 Excel 檔。你會用到計算報酬率的公式如下： <br> <center> Return rate (%) = $\dfrac{P_t - P_{t - 1}}{P_{t - 1}} \times 100$ </center> 請利用 DataFrame 中的 rebase() 來進行計算。 <br> <center> ![](https://i.imgur.com/1MhEszO.png =600x) <!-- [Demo Code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab1_demo.ipynb) --> </center> ::: ### 機率論 - 古典機率 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture1_ProbabilityModel_2019.pdf) - 一些重要的專有名詞：樣本空間、事件、機率公設、機率測度、條件機率、獨立事件 - 隨機變數 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture2_RandomVariable_2019.pdf) [code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_2_probability_models_and_random_number_generators.ipynb) - 離散隨機變數：白努利分配、二項式分配 - 連續隨機變數 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture5_Normal_2019.pdf)：均勻分配、常態分配、$\chi^2$ 分配、Student's t 分配、F 分配 - 可於 SciPy 的文件中找到已經實作的機率模型： - https://docs.scipy.org/doc/scipy/reference/stats.html - https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html - 獨立同分配 (**iid**, **i**ndependent and **i**dentically **d**istributed) - 機率族譜 [pic](https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/Relationships_among_some_of_univariate_probability_distributions.jpg/1920px-Relationships_among_some_of_univariate_probability_distributions.jpg) - Poisson分配 [pdf](http://www.stats.ox.ac.uk/~filippi/Teaching/psychology_humanscience_2015/lecture5.pdf) - 期望值與多變量隨機變數 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture4_Moments_2019.pdf) - (FYR) Taylor expansion [wiki](https://en.wikipedia.org/wiki/Taylor_series) - 集中趨勢：算術平均數 (arithmetic mean)、幾何平均數 (geometric mean)、中位數 (median)、眾數 (mode) - 變異程度：變異數 (variance)、標準差 (standard deviation)、全距 (full range) - 高階動差：偏態 (skewness)、峰態 (kurtosis) - (FYR) 動差生成函數 (moment generating function, mgf) [pdf](https://web.ma.utexas.edu/users/gordanz/notes/mgf_color.pdf) - 共變異數與相關係數 (covariance & correlation) - Zero correlation implies independence? - 條件期望值 (conditional expectation / variance) - Law of Total Variance [wiki](https://en.wikipedia.org/wiki/Law_of_total_variance) <center> ![](https://i.imgur.com/HyJiGlg.png =500x) </center> :::success <b><font size = 4>++Lab 2++</font> 亂數產生</b> 實作一個以線性同餘演算法 ([linear congruential generator](https://en.wikipedia.org/wiki/Linear_congruential_generator)) <font color ="red">偽</font>亂數產生器。該演算法透過一個遞迴方程式 $x_n = (a \times x_{n - 1} + c) \% m$，使得電腦可以有效率地產生遵從均勻分配的亂數。為了方便，參數為 $a = 1664525$，$c = 1013904223$ 與 $m = 2^{32}$。你的工作是撰寫一個函式，輸入參數為亂數種子 $x_0$ (任意的正整數)、亂數數量 $n$，輸出為長度 $n$、介於 $0$ 到 $1$ 之間的亂數。為了確認樣本的分佈，可繪製樣本的直方圖以視覺的方式檢查；或可以使用 scipy.stats.kstest(samples, "uniform") 的指令進行檢定。 <br> <center> ![](https://i.imgur.com/NzNnpcb.png) <!-- [Demo Code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab2_demo.ipynb) --> </center> 補充閱讀：偽亂數 (pseudo randomness) [link](https://en.wikipedia.org/wiki/Pseudorandomness) ::: :::success <b><font size = 4>++Lab 3++</font> 期望值</b> 假設隨機變數 $Y$ 遵從下列的分佈： <center> ![](https://i.imgur.com/rBFWcDO.png =300x) </center> 則可知 $\mathbb{E}(Y) = 0.9$。請寫一個程式模擬從此分佈抽出的樣本，其樣本平均值會逼近期望值，當樣本大小從 1 到 3000。 <center> ![](https://i.imgur.com/6rX2hqO.png =500x) <!-- [Demo Code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab3_demo.ipynb) --> </center> ::: ### 統計學框架 - 抽樣方法與樣本分配 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture6_Sampling_2019.pdf) - https://en.wikipedia.org/wiki/Sampling_(statistics) - 統計檢定 [pdf](http://www.stats.ox.ac.uk/~filippi/Teaching/psychology_humanscience_2015/lecture8.pdf) - 關鍵字們：虛無/對立假設 (null/alternative hypothesis)、p-value、顯著水準 (significance level)、拒絕區 (rejecting region)、型一/二/三誤差 (type I / II / III errors) - SciPy上的案例 [link](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/) - 額外閱讀的材料 [pdf1](https://www2.isye.gatech.edu/~yxie77/isye2028/lecture8.pdf), [pdf2](http://www.sci.utah.edu/~arpaiva/classes/UT_ece3530/hypothesis_testing.pdf) - 案例： - 獨立檢定 ($\chi^2$ independence test) [link](https://www.statology.org/chi-square-test-of-independence/) <font color = "red" size = -1>new</font> <center> ![](https://imgs.xkcd.com/comics/null_hypothesis.png =200x) </center> - 線性迴歸 (linear regression) [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/stat_3_linear_regression.ipynb) <!--[pdf](https://www.csie.ntu.edu.tw/~d00922011/python/slides/linear_regression.pdf)--> - Python套件 **statsmodels** [link](https://www.statsmodels.org/stable/regression.html) - 補充說明: - Interpreting Results from Linear Regression – Is the data appropriate? [link](https://www.accelebrate.com/blog/interpreting-results-from-linear-regression-is-the-data-appropriate) - About errors and residuals [wiki](https://en.wikipedia.org/wiki/Errors_and_residuals) - 常態分配檢定 (normality tests) [pdf](http://webspace.ship.edu/pgmarr/Geo441/Lectures/Lec%205%20-%20Normality%20Testing.pdf) - (FYR) Seier (2014): [Normality Tests: Power Comparison](https://link.springer.com/referenceworkentry/10.1007%2F978-3-642-04898-2_421) - (FYR) Jarque (2014): [Jarque-Bera Test](https://link.springer.com/referenceworkentry/10.1007/978-3-642-04898-2_319) - (FYR) Bowman and Shenton (2014): [Omnibus Test](https://link.springer.com/referenceworkentry/10.1007%2F978-3-642-04898-2_426) - How to detect the multicollinearity? - [Variance inflation factor](https://en.wikipedia.org/wiki/Variance_inflation_factor) (VIF) - How to detect the heteroscedasticity? - Weighted least-square (WLS) method, one of Generalized least-square (GLS) method - 更多案例: - Buffett's alpha by AQR Capital Management [link](https://www.aqr.com/Insights/Research/Journal-Article/Buffetts-Alpha) [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Buffett_Alpha.pdf) [方格子解說 Buffett's alpha](https://vocus.cc/tarcy2801/5d6fc9fdfd89780001ca4e42) - 更多關於類別 (categorical) 資料的迴歸 [link](https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/) - Data transformations [link](http://www.biostathandbook.com/transformation.html) - Log transformation: for size data - Square-root transformation: for count data - Arcsine transformation - [Ramsey RESET test](https://en.wikipedia.org/wiki/Ramsey_RESET_test) > If the proposed model is adequate, then the standardized residuals should be a white noise. > It tests whether non-linear combinations of the fitted values help explain the response variable. <center> ![](https://imgs.xkcd.com/comics/linear_regression_2x.png =300x) </center> - 參數估計 - 點估計 (point estimation) [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture8_PointEst_2019.pdf) - 方法 - 類比法 - 動差法 (method of moments) - 最大概似估計 (maximum likelihood estimation, MLE) - 好的估計式至少具備三個性質： - 無偏 (unbiased) - 有效率 (efficient) - 一致 (consistent) - 最佳線性無偏估計式 (best Linear unbiased estimator, BLUE) - Gauss-Markov Theorem > ... the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero. - 充分統計量 (sufficient statistic) 與最小變異數不偏估計 (uniformly minimum-variance unbiased estimator, UMVUE) - 區間估計 (interval estimation) [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture9_IntervalEst_2019.pdf) - 什麼是 95% 的信賴區間 (confidence interval)? <!-- - Bootstrapping [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_3_bootstrapping.ipynb) --> <center> ![](https://i.imgur.com/BXhjfPu.png =300x) </center> - 變異數分析 (analysis of variance, ANOVA) [pdf](http://amath2.nchu.edu.tw/honda/605Lecture/Lecture10_ANOVA.pdf) [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/stat_4_ANOVA_example.ipynb) - Why not t-test? [link](https://www.analyticsvidhya.com/blog/2018/01/anova-analysis-of-variance/) > Another measure to compare the samples is called a t-test. When we have only two samples, t-test and ANOVA give the same results. However, using a t-test would not be reliable in cases where there are more than 2 samples. If we conduct multiple t-tests for comparing more than two samples, it will have a **confounding effect** on the error rate of the result. - Confounding effect: https://www.scribbr.com/methodology/confounding-variables/ - 更多案例們： - One-way ANOVA: https://www.pythonfordatascience.org/anova-python/ - Two-way ANOVA: http://www.pybloggers.com/2016/03/three-ways-to-do-a-two-way-anova-with-python/ - Design of experiments, DOE) - 三大基本原則：randomization、replication、blocking - 去除干擾變數對反映變數的影響： - 未知且不可控：full randomization - 已知但不可控：analysis of covariance (ANCOVA) - 已知且可控：blocking (one of method for local control to lower SSE and increase precision) - Latin square design, LSD <center> ![](https://i.imgur.com/pp5QR5A.png =400x) </center> - 漸進理論 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture7_Asymptotics_2019.pdf) - 收斂性 - 大數法則 - 中央極限定理 > The fact that sampling distributions can approximate a normal distribution has critical implications. In statistics, the normality assumption is vital for parametric hypothesis tests of the mean, such as the t-test. Consequently, you might think that these tests are not valid when the data are nonnormally distributed. However, if your sample size is large enough, the central limit theorem kicks in and produces sampling distributions that approximate a normal distribution. This fact allows you to use these hypothesis tests even when your data are nonnormally distributed—as long as your sample size is large enough. See [link](https://statisticsbyjim.com/basics/central-limit-theorem/). :::success <b><font size = 4>++Lab 4++</font> 檢驗中央極限定理</b> 撰寫一個程式模擬自下列不同的分佈中抽取不同大小的樣本，找出最小的樣本大小使其樣本分佈不被常態檢定 (normality test) 拒絕： 1. 標準均勻分配 2. 卡方分配 ($\chi^2$ distribution with df = 3) 3. Poisson分配 (Poisson distribution with $\mu = 3$) 4. 柯西分配 (the standard Cauchy Distribution) <center> ![](https://i.imgur.com/KlEobFi.png =300x) ![](https://i.imgur.com/X6D5TA9.png =300x) ![](https://i.imgur.com/U6B6bFG.png =300x) ![](https://i.imgur.com/ShQeTfB.png =300x) <!-- [Demo Code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab4_demo.ipynb) --> </center> ::: <center> ![](https://i.imgur.com/EZwESD1.png =250x) </center> - 時間序列分析 (time series analysis) [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_4_time_series.ipynb) - 自相關性 (autocorrelation) - 平穩性質 (stationariness) 與單根檢定 (unit root test) - 自回歸模型 (autoregressive model, AR) - 移動平均模型 (moving average model, MA) - ARMA$(p, q)$ 與 ARIMA$(p, d, q)$ 模型 - 貝氏機率 (Bayesian probability) [pdf](http://pillowlab.princeton.edu/teaching/mathtools16/slides/lec13_BayesRule.pdf) - Marc Garcia, [Bayesian inference tutorial: a hello world example](https://datapythonista.me/blog/bayesian-inference-tutorial-a-hello-world-example.html), 2020 - Samuel Hinton, [Bayesian Linear Regression in Python](https://cosmiccoding.com.au/tutorials/bayes_lin_reg), 2019 - 參考材料： - (FYR) 從經驗中學習 - 直觀理解貝氏定理及其應用 [link](https://leemeng.tw/intuitive-understandind-of-bayes-rules-and-learn-from-experience.html) - (FYR) 別再瞎猜、靠運氣！NASA、微軟都在用「貝式理論」做決策 [link](https://buzzorange.com/techorange/2019/07/24/nasa-how-to-make-right-decision/) - (FYR) Chapter 12: Bayesian Inference, Statistical Machine Learning, CMU [pdf](http://www.stat.cmu.edu/~larry/=sml/Bayes.pdf) - [Introduction to Bayesian Modeling with PyMC3](https://juanitorduz.github.io/intro_pymc3/) - [Bayes’ Rule With Python](http://jim-stone.staff.shef.ac.uk/BookBayes2012/bookbayesch01WithPython.pdf) - [Monty Hall Problem](https://en.wikipedia.org/wiki/Monty_Hall_problem) - https://www.astronomy.swin.edu.au/~cblake/StatsLecture4.pdf - https://astrostatistics.psu.edu/RLectures/IntroBayes-1.pdf - https://cse.buffalo.edu/~jcorso/t/CSE555/files/lecture_bayesiandecision.pdf - [貝氏統計學的概念.pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/貝氏統計學的概念.pdf) ### 機器學習導論 - (FYR) Deep Mind: A Documentary File [youtube](https://www.youtube.com/watch?v=WXuK6gekU1Y) - 回歸分析 (regression) [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_5_machine_learning_tutorial.ipynb) - Ridge regression - LASSO regression - Logistic regression - 支持向量機 (support vector machine, SVM) - 決策樹 (decision tree) 與隨機森林 (random forest) - 主成分分析 (principal component analysis, PCA) - https://setosa.io/ev/principal-component-analysis/ - K-means clustering - 增強式學習 (reinforcement learning): Q-Learning - 深度學習 (deep learning) - https://www.kaggle.com/learn/intro-to-deep-learning - 案例學習：Jacky Hsueh, [為什麼需要經濟理論來預測經濟趨勢:比較機器學習與計量經濟](http://economicsnote.com/%E7%82%BA%E4%BB%80%E9%BA%BC%E9%9C%80%E8%A6%81%E7%B6%93%E6%BF%9F%E7%90%86%E8%AB%96%E4%BE%86%E9%A0%90%E6%B8%AC%E7%B6%93%E6%BF%9F%E8%B6%A8%E5%8B%A2%E6%AF%94%E8%BC%83%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92/), 2021.2.26 - [Vapnik–Chervonenkis dimension](https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension) - [Receiver operating characteristic (ROC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) <center> ![](https://i.imgur.com/PY21A4V.png =350x) </center> :::success <b><font size = 4>++Lab 5++</font> K-Means 演算法實作</b> 撰寫一個程式實現 K-Means 演算法。[樣板程式](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab5_template.ipynb)內已經能夠產生測試用的資料，如下方左圖所示。K-Means的基本精神在於透過**距離**的遠近來歸納組別。演算法步驟可以參考此[連結](https://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm_(naive_k-means))。該演算法分群的結果如下方右圖所示，其中紅色菱形的符號代表該群的算術中心點。我的目標是希望學員可以實現基本的K-Means演算法。注意，分群結果 (右圖) 沒有保證會跟正確答案 (左圖) 相同，故本程式的重點是演算法的實作。 <br> <center> ![](https://i.imgur.com/329c7JV.png =700x) </center> ::: ### 統計實務 - 無母數分析 - 等級相關 - Spearman 等級相關係數 - Kendall 等級相關係數 - 單一母體 - 符號檢定 (sign test) - Wilcoxcon 符號等級檢定 - 兩相依母體 - 配對符號檢定 - Wilcoxcon 配對符號等級檢定 - 兩獨立母體 - Wilcoxon 等級和檢定 - Mann-Whitney U 檢定 - 多獨立母體 - Kruskal-Wallis 檢定 - 多相依母體 - Friedman 檢定 - 隨機性檢定 - 連檢定 - 核密度函數估計 (Kernel density estimation, KDE) [link](https://scikit-learn.org/stable/modules/density.html) <center> ![](https://i.imgur.com/ATECGO7.png =400x) </center> - 小樣本分析 (small-sample analysis) - Fisher test [link](http://blog.pulipuli.info/2017/05/fishers-exact-test-example.html) <center> ![](https://i.imgur.com/JyiaDyG.jpg =250x) </center> - 雙峰/多峰分佈 (bimodal/multimodal distribution) - https://en.wikipedia.org/wiki/Mixture_model - 極值理論 (Extreme value theory, EVT) - https://en.wikipedia.org/wiki/Heavy-tailed_distribution - https://en.wikipedia.org/wiki/Extreme_value_theory ## 候選題目 - 華人有冬天進補的文化，若今年冬天溫度特別的低，請問是否會影響到食品類股的價格上揚? - Markov Chain Monte Carlo (MCMC) - 當沖金額佔當日交易金額的比例增加時，是否意味著行情即將轉空？ - 流行病學模型 https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology - 到底是缺電?還是超用? - 挖礦的電力消耗占比? - 時間序列版本的 $R^2$。 - $R^2$ 為一個 beta 分佈，開 n 次方之後為一個常態分佈。 ## 資料來源 ### 台灣政府公開資料 - 政府開放資料中心：https://data.gov.tw/ - 臺北市資料大平臺：https://data.taipei/ - 中央氣象局公開資料：https://opendata.cwb.gov.tw/dataset/observation - 薪情體驗：https://earnings.dgbas.gov.tw/experience_sub_01.aspx - https://www.numbeo.com/cost-of-living/ - 國家發展委員會人口推估查詢系統：https://pop-proj.ndc.gov.tw/index.aspx - 內政部統計處：https://www.moi.gov.tw/stat/index.aspx - 內政部不動產交易實價查詢：https://lvr.land.moi.gov.tw/homePage.action - 用程式分析房地產可行嗎？房價分析看這裡！ by [FinLab](https://www.finlab.tw/real-estate-analasys-histograms/) - 文化部資料開放服務網 https://opendata.culture.tw - 台灣電力公司 https://www.taipower.com.tw/tc/index.aspx - 政府資料開放平臺資料集清單 - 台灣電力股份有限公司 [link](https://sheethub.com/data.gov.tw/%E6%94%BF%E5%BA%9C%E8%B3%87%E6%96%99%E9%96%8B%E6%94%BE%E5%B9%B3%E8%87%BA%E8%B3%87%E6%96%99%E9%9B%86%E6%B8%85%E5%96%AE/i/44/%E5%8F%B0%E7%81%A3%E9%9B%BB%E5%8A%9B%E8%82%A1%E4%BB%BD%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8) - 彩券相關 - 超讚的樂透網：https://zan01.com/ - 樂透堂：http://www.9800.com.tw/ ### 國外公開資料來源 - U.S. Census Bureau: https://www.census.gov/ - World Bank: https://data.worldbank.org/ - NASA: https://nasa.github.io/data-nasa-gov-frontpage/data_visualizations.html - Data World: https://data.world/ - Human Development Reports: http://www.hdr.undp.org/en - Sports Reference: https://www.baseball-reference.com/ - Data bank of Bank of England: https://www.bankofengland.co.uk/statistics ### 分析平台 - Google Data Studio: https://datastudio.google.com/ ### 競賽平台 - https://www.kaggle.com/datasets/ - https://zindi.africa/competitions ## 參考資料 ### 書籍 #### 曾經使用過的教科書 - Thomas Haslwanter, [An Introduction to Statistics with Python](https://www.springer.com/us/book/9783319283159), 2016 <font size = -1 color = "gray">可在台大校園IP範圍內進行下載！</font> ![](https://i.imgur.com/ypOd5fQ.png =100x) - José Unpingco, [Python for Probability, Statistics, and Machine Learning](https://link.springer.com/book/10.1007%2F978-3-319-30717-6), 2/e [link](https://link.springer.com/book/10.1007%2F978-3-030-18545-9 ) <font size = -1 color = "gray">可在台大校園IP範圍內進行下載！</font> ![](https://i.imgur.com/vdfeIre.png =100x) - Jake VanderPlas, [Python Data Science Handbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/), 2016 <font size = -1>[online](https://github.com/jakevdp/PythonDataScienceHandbook)</font> <font size = -1>[github](https://github.com/jakevdp/PythonDataScienceHandbook)</font> ![](https://i.imgur.com/Bdf0PTa.png =100x) #### 機率論 - Sheldon Ross, [Introduction to Probability Models](https://www.elsevier.com/books/introduction-to-probability-models/ross/978-0-12-814346-9), 12/e, 2019 ![](https://i.imgur.com/57W4fGF.jpg =100x) #### 數理統計 - Robert V. Hogg, Joseph W. McKean, and Allen T. Craig, [Introduction to Mathematical Statistics](https://www.amazon.com/-/zh_TW/dp/0134686993/), 8/e, 2019 ![](https://i.imgur.com/2juN4uq.png =100x) - George Casella and Roger L. Berger, [Statistical Inference](https://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126), 2/e, 2001 ![](https://i.imgur.com/D9fxPRb.png =100x) #### 實驗設計 - Douglas C. Montgomery, [Design and Analysis of Experiments](https://www.wiley.com/en-gb/Design+and+Analysis+of+Experiments,+9th+Edition-p-9781119320937), 9/e, 2017 ![](https://i.imgur.com/KH0xEM8.jpg =100x) - Angela Dean, Daniel Voss, and Danel Draguljić, [Design and Analysis of Experiments](https://link.springer.com/book/10.1007/978-3-319-52250-0), 2017 ![](https://i.imgur.com/5JKXVjP.jpg =100x) #### 統計學通論 - Barbara Blatchley, [Statistics in Context](https://www.amazon.com/Statistics-Context-Barbara-Blatchley/dp/0190278951), 2018 ![](https://i.imgur.com/MFDwoYZ.png =100x) - 張翔與廖崇智，++提綱挈領學統計++，第八版，2019/6/14 ![](https://i.imgur.com/DyXbeFs.png =130x) - 許誠哲，++統計學：重點觀念與題解++，2018/3/1 ![](https://i.imgur.com/BdkBLQ0.png =90x) ![](https://i.imgur.com/2w0llwW.png =90x) - David M. Lane and etc, ++Online Statistics Education++: http://onlinestatbook.com/Online_Statistics_Education.pdf #### 時間序列 - 陳旭昇，[時間序列分析 - 總體經濟與財務金融之應用](http://homepage.ntu.edu.tw/~sschen/Book/Book2.htm)，第二版 ![](https://i.imgur.com/b2Ubehb.png =100x) - ++Introduction to Econometrics with R++: https://www.econometrics-with-r.org/index.html #### 機器學習 - Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, [An Introduction to Statistical Learning with Applications in R](https://link.springer.com/book/10.1007/978-1-4614-7138-7), 2013 ![](https://i.imgur.com/sCY7WYs.png =100x) - Trevor Hastie, Robert Tibshirani, and Jerome Friedman, [The Elements of Statistical Learning: Data Mining, Inference, and Prediction](https://link.springer.com/book/10.1007/978-0-387-84858-7), 2009 ![](https://i.imgur.com/u1Mi1mt.png =100x) ### 科學普及閱讀 - History of Statistics: https://www.york.ac.uk/depts/maths/histstat/ ![](https://i.imgur.com/UzJP2o3.png =100x) - 安德魯·維克斯，++34個讓你豁然開朗的統計學小故事++，2019/03/28 ![](https://i.imgur.com/QYCD6Fo.png =140x) - 羅伯特·艾貝爾森，++一位耶魯大學教授的統計箴言++，2019/05/28 ![](https://i.imgur.com/okSuBrz.png =100x) - 塚本邦尊...等，++東京大學資料科學家養成全書：使用Python動手學習資料分++ ![](https://i.imgur.com/FebESuA.png =150x) ### 國外內課程 - Prof. Shiu-Sheng Chen (http://homepage.ntu.edu.tw/~sschen/) - Dr. Shao-Wei Cheng (http://www.stat.nthu.edu.tw/~swcheng/) - ++Statistics++: http://www.stat.nthu.edu.tw/~swcheng/Teaching/math2820/index.php - ++Probability++: http://www.stat.nthu.edu.tw/~swcheng/Teaching/math2810/index.php - ++Experimental Design and Analysis++: http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5510/index.php - ++Mathematical Statistics++: http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat3875/index.php - ++Discrete Analysis++: http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5230/index.html - ++STA-663-2017++: http://people.duke.edu/~ccc14/sta-663-2017/ - Dr. Peter Kempthorne, ++Lecture notes on Probability and Statistics++: http://users.encs.concordia.ca/~doedel/courses/comp-233/slides.pdf - ++Mathematical Statistic++s: https://ocw.mit.edu/courses/mathematics/18-655-mathematical-statistics-spring-2016/lecture-notes/ - Emmanuel Candès, https://statweb.stanford.edu/~candes/ - ++Theory of Statistics++: https://statweb.stanford.edu/~candes/teaching/stats300c/ - ++Modern Markov Chain++: https://statweb.stanford.edu/~candes/teaching/stats318/ - Oleg Melnikov, ++Introduction to Statistical Inference++: http://stats200.stanford.edu/ - Liam Paninski, ++Computational Statistics++: http://www.stat.columbia.edu/~liam/teaching/compstat-spr19/ - David Aldous, ++Probability and the Real World++: https://www.stat.berkeley.edu/~aldous/157 - Bocheng Jing, ++Sta102 - Intro Biostatistics++: https://www2.stat.duke.edu/courses/Spring13/sta102.001/ ### 雜項 - Secretary problem - https://zh.wikipedia.org/wiki/%E7%A7%98%E6%9B%B8%E5%95%8F%E9%A1%8C - https://style.udn.com/style/story/8073/1452739 - http://www.statslife.org.uk/images/pdf/timeline-of-statistics.pdf - 王超辰：醫學統計學 https://bookdown.org/ccwang/medical_statistics6/ - Ioane Muni Toke, ++An Introduction to Hawkes Processes with Applications to Finance++, 2011: http://lamp.ecp.fr/MAS/fiQuant/ioane_files/HawkesCourseSlides.pdf - https://ourworldindata.org/grapher/annual-working-hours-vs-gdp-per-capita-pwt - https://wol.iza.org/uploads/articles/228/pdfs/female-education-and-its-impact-on-fertility.pdf <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQy4hVPBShjYibAF4SvDXxctkOVNOIiho5Wx56nXTP4sonS-XFHZL2galgDbGdRzc4DrjQPo_8pabrK/pubhtml?gid=1782321283&amp;single=true&amp;widget=true&amp;headers=false" width = "700px" height = "700px"></iframe> <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQy4hVPBShjYibAF4SvDXxctkOVNOIiho5Wx56nXTP4sonS-XFHZL2galgDbGdRzc4DrjQPo_8pabrK/pubhtml?gid=691888123&amp;single=true&amp;widget=true&amp;headers=false" width = "700px" height = "650px"></iframe> <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQy4hVPBShjYibAF4SvDXxctkOVNOIiho5Wx56nXTP4sonS-XFHZL2galgDbGdRzc4DrjQPo_8pabrK/pubhtml?gid=673696095&amp;single=true&amp;widget=true&amp;headers=false" width = "700px" height = "400px"></iframe> <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQy4hVPBShjYibAF4SvDXxctkOVNOIiho5Wx56nXTP4sonS-XFHZL2galgDbGdRzc4DrjQPo_8pabrK/pubhtml?gid=1391095507&amp;single=true&amp;widget=true&amp;headers=false" width = "700px" height = "500px"></iframe> <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQy4hVPBShjYibAF4SvDXxctkOVNOIiho5Wx56nXTP4sonS-XFHZL2galgDbGdRzc4DrjQPo_8pabrK/pubhtml?gid=0&amp;single=true&amp;widget=true&amp;headers=false" width = "700px" height = "500px"></iframe>