資料科學入門 (Introduction to Data Science)

原課程名：當統計學與程式相遇 (Learning Statistics & Programming)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

"The purpose of computing is insight, not numbers."
-- Richard W. Hamming

"Data do not speak for themselves;
there is always an interpreter, or a translator."
-- John W. Ratcliffe

"Remember that all models are wrong;
the practical question is how wrong do they have to be to not be useful."
-- George Box

"It is easy to lie with statistics, but easier to lie without them."
-- Frederick Mosteller

"Science is more than a body of knowledge;
it is a way of thinking.
The method of science, as stodgy and grumpy as it may seem,
is far more important than the findings of science."
-- Carl Sagan

講者訊息

盧政良 (Zheng-Liang Lu, Arthur)
聯絡方式：arthurzllu@gmail.com

工作環境

Google Colab https://colab.research.google.com/

本課程不限制程式語言，但課程將以 Python 作為示範；學員可使用 Excel、R、或者 MATLAB 進行課程內容，惟須自行尋找對應的工具來完成問題。

預備知識

四則運算與代數運算
生活經驗與公民道德
(Optional) 微積分
- 台大數學系朱樺老師微積分
- 3Blue1Brown, Essence of Calculus
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
(Optional) 線性代數
- Stephen H. Friedberg, Arnold J. Insel, Lawrence E. Spence, Linear Algebra, 5/e, 2018
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
- 3Blue1Brown, Essence of Linear Algebra on Google Youtube
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →

本課程所牽涉到的數學，目前只需要了解其脈絡與結果，並依照個人的興趣與能力決定是否需要熟練推導或計算的細節；Python 的套件中已經實現多數的數學結果，故繁瑣的計算可交付給電腦完成。

學習目標

統計學
- 了解統計工具與計算原理
- 正確解釋統計結果
- 合理預測資料的趨勢
- 排除統計謬誤
程式能力
- 掌握資料處理流程
- 學習創造自己的工具

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

評分標準

實體課程版

期末專題成果發表
- 重點項目：提問、資料收集與視覺化、模型假設、實驗結果、結論。
- 分組原則：一人一組，報告以投影片或 jupyter notebook 進行。
完成五次程式作業或完成期末專題報告的學員可獲頒本課程之證書。
請將作業寄信至 arthurzllu@gmail.com 並註明課程名稱與學員姓名。

線上課程版

完成五次程式作業的學員可以取得本課程之證書。
作業繳交方式為將 jupyter notebook 上傳到 NTU COOL。

授課對象

欲學習使用統計方法、量化研究的大專院校生、相關科研人員與工程師。
國高中生可，已學習過基礎統計學者佳 (108 課綱高二機率與統計 I 與高三機率與統計 II)。

主要參考書目

Steven S Skiena, The Data Science Design Manual, 2017
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Laura Igual and Santi Seguí, Introduction to Data Science, 2017
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
陳旭昇，統計學：應用與進階，第三版，2015
Image Not Showing Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →

課程大綱

Python 程式基礎
資料擷取與視覺化
機率論導論與常見的機率模型
統計檢定
點估計與區間估計
大數法則與中央極限定理
回歸模型
時間序列分析
貝氏機率
機器學習簡介
(Optional) 統計實務

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

課程內容

Python 程式能力速成

Python 程式能力速成 notebook
- 資料型態與基礎運算
- 有條件的敘述
- 重複的敘述
- 函式
補充材料
- 與程式能力相關的額外訊息 pdf
- 自學程式的 app
  - Learn Python

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

資料擷取與視覺化

Pandas notebook
- Python data analysis library link
  - (FYR) https://www.kaggle.com/learn/pandas
  - (FYR) Cheat sheet: link1, link2
資料預處理
- 案例 1: 資料預處理 code
- 案例 2: 金融時間序列 code
- 案例 3: JSON 檔案 code
- 案例 4: 合併 DataFrame link
- 案例 5: 交叉表 code
- (FYR) 字串處理
  - 正規表示法 (regular expressions)
    - 互動式教學網站 https://regexone.com/
  - Python 套件：https://docs.python.org/3/library/re.html
- (FYR) Pythonic data cleaning with numpy and pandas link
- (FYR) https://www.kaggle.com/learn/data-cleaning
資料視覺化
- Matplotlib 官方文件 link
- (FYR) http://scipy-lectures.org/intro/matplotlib/index.html
- Cheat sheets by DataCamp: pdf
- 一個不錯的教學文件 Nicolas P. Rougier
- (FYR) https://www.kaggle.com/learn/data-visualization

Lab 1 使用 Pandas

(1) 計算每隻股票過去一年中日變化率超過

\pm 9 %

的次數。
(2) 按發生的次數排名這些資產。
(3) 將排名的結果存檔為 csv 文件。
(4) 繪製前 3 名贏家和前 3 名輸家的時間序列圖；通過使用 rebase() 對每個時間序列進行標準化來比較這些資產的相對表現 (即算出損益)。
(5) 製作散佈圖，觀察損益和發生次數 (即日變化率超過

\pm 9 %

的次數) 之間是否存在某些關係。
(6) 將散佈圖另存為 pdf 文件。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

樣板 / 參考解答

機率論

古典機率 pdf
- 一些重要的專有名詞：樣本空間、事件、機率公設、機率測度、條件機率、獨立事件
- 機率等於零，代表不會發生?
  - Cantor set
  - 台大數學系陳俊全老師給了一個有趣的類比：Cantor set 就很像窮人的口袋。要吃飯要看病，口袋總是能夠拿得出一些錢；但是問身家財產是多少時，和為零。
    - 陳俊全老師語錄集：https://disp.cc/b/181-1wia
隨機變數 pdf / code
- 離散隨機變數：白努利分配、二項式分配
- 連續隨機變數 pdf：均勻分配、常態分配、
  $χ^{2}$ 分配、Student's t 分配、F 分配
- 可於 SciPy 的文件中找到已經實作的機率模型：
  - https://docs.scipy.org/doc/scipy/reference/stats.html
- 機率族譜 pic
  - Poisson 分配 pdf
- 亂數生成 (random number generation, RNG)
  - 偽亂數 (pseudo randomness) link
  - random – Generate pseudo-random numbers
  - Inverse transform sampling for generating sample numbers at random from any probability distribution given its cumulative distribution function.
期望值與多變量隨機變數 pdf
- 集中趨勢：算術平均數 (arithmetic mean)、幾何平均數 (geometric mean)、中位數 (median)、眾數 (mode)
- 變異程度：變異數 (variance)、標準差 (standard deviation)、全距 (full range)
- 高階動差：偏態 (skewness)、峰態 (kurtosis)
  - 正 (負) 偏態：平均值高 (低) 於中位數
  - 峰態 > 3：相較於常態分佈具有厚尾 (heavy tail) 現象
- (FYR) 動差生成函數 (moment generating function, mgf) pdf
  - Taylor expansion wiki
  - What is Moment?
- 共變異數與相關係數 (covariance & correlation)
  - Zero correlation implies independence?
- 條件期望值 (conditional expectation / variance)
  - Law of Total Variance wiki
- 獨立同分配 (iid, independent and identically distributed)

Lab 2 甚麼是期望值?

假設隨機變數

Y

遵從下列的分佈：

則可知

E (Y) = 0.9

。請寫一個程式模擬此分佈抽出的樣本，其樣本平均值會逼近期望值，當樣本大小從 1 到 3000。

Demo Code

統計學框架

抽樣方法與樣本分配 pdf
- https://en.wikipedia.org/wiki/Sampling_(statistics)
統計檢定 pdf
- 關鍵字們：虛無/對立假設 (null / alternative hypothesis)、p-value、顯著水準 (significance level)、拒絕區 (rejecting region)、型一/二/三誤差 (type I / II / III errors)
- SciPy上的案例 link
- 額外閱讀的材料 pdf1, pdf2
- 案例：
  - 獨立檢定 (
    $χ^{2}$ independence test) new
    - Lesson 8: Chi-Square Test for Independence, code
    - SPSS tutorials: Chi-Square Test of Independence

線性迴歸 (linear regression) notebook
- Python套件 statsmodels link
- 補充說明:
  - Interpreting Results from Linear Regression – Is the data appropriate? link
  - About errors and residuals wiki
  - 常態分配檢定 (normality tests) pdf
    - (FYR) Seier (2014): Normality Tests: Power Comparison
    - (FYR) Jarque (2014): Jarque-Bera Test proposed by Jarque and Bera (1980).
    - (FYR) Bowman and Shenton (2014): Omnibus Test proposed by D’Agostino (1973).
  - How to detect the multicollinearity?
    - Variance inflation factor (VIF)
  - How to detect the heteroscedasticity?
    - Weighted least-square (WLS) method, one of Generalized least-square (GLS) method
- 更多案例:
  - Financial market
    - Buffett's alpha by AQR Capital Management link pdf 方格子解說 Buffett's alpha
    - Beta coefficient for TSM ADR on Yahoo Finance link code
    - Re: [新聞] 限空令+國安基金！台股盤中漲300點攻上
  - 更多關於類別 (categorical) 資料的迴歸 link
  - Data transformations link
    - Log transformation: for size data
    - Square-root transformation: for count data
    - Arcsine transformation: normalize to 0 ~ 1, for example, volume (cumulative) distribution across trading hours.
  - Ramsey RESET test
    
    If the proposed model is adequate, then the standardized residuals should be a white noise.
    It tests whether non-linear combinations of the fitted values help explain the response variable.

Lab 3 簡單線性回歸

TBA

參數估計
- 點估計 (point estimation) pdf
  - 方法
    - 類比法
    - 動差法 (method of moments)
    - 最大概似估計 (maximum likelihood estimation, MLE)
  - 好的估計式至少具備三個性質：
    - 無偏 (unbiased)
    - 有效率 (efficient)
    - 一致 (consistent)
  - 最佳線性無偏估計式 (best Linear unbiased estimator, BLUE)
    - Gauss-Markov Theorem
    … the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero.
  - 充分統計量 (sufficient statistic) 與最小變異數不偏估計 (uniformly minimum-variance unbiased estimator, UMVUE)
    - 詳見數理統計。
- 區間估計 (interval estimation) pdf
  - 什麼是 95% 的信賴區間 (confidence interval)?

變異數分析 (analysis of variance, ANOVA) pdf notebook
- Why not t-test? link
  
  Another measure to compare the samples is called a t-test. When we have only two samples, t-test and ANOVA give the same results. However, using a t-test would not be reliable in cases where there are more than 2 samples. If we conduct multiple t-tests for comparing more than two samples, it will have a confounding effect on the error rate of the result.
  - Confounding effect: https://www.scribbr.com/methodology/confounding-variables/
- 更多案例們：
  - One-way ANOVA: https://www.pythonfordatascience.org/anova-python/
  - Two-way ANOVA: http://www.pybloggers.com/2016/03/three-ways-to-do-a-two-way-anova-with-python/
  - Design of experiments (DoE)
    - 三大基本原則：randomization、replication、blocking
    - 去除干擾變數對反映變數的影響：
      - 未知且不可控：full randomization
      - 已知但不可控：analysis of covariance (ANCOVA)
      - 已知且可控：blocking (one of method for local control to lower SSE and increase precision)
        
        Latin square design, LSD

漸進理論 pdf
- 收斂性 (convergence)
- 大數法則 (Law of Large Number, LLN)
- 中央極限定理 (Central Limiting Theorem, CLT)
  
  The fact that sampling distributions can approximate a normal distribution has critical implications. In statistics, the normality assumption is vital for parametric hypothesis tests of the mean, such as the t-test. Consequently, you might think that these tests are not valid when the data are nonnormally distributed. However, if your sample size is large enough, the central limit theorem kicks in and produces sampling distributions that approximate a normal distribution. This fact allows you to use these hypothesis tests even when your data are nonnormally distributed—as long as your sample size is large enough. See link.
- 補充材料：
  - https://python.quantecon.org/lln_clt.html

Lab 4 檢驗中央極限定理
撰寫一個程式模擬自下列不同的分佈中抽取不同大小的樣本，找出最小的樣本大小使其樣本分佈不被常態檢定 (normality test) 拒絕：

標準均勻分配
卡方分配 (
$χ^{2}$ distribution with df = 3)
Poisson分配 (Poisson distribution with
$μ = 3$ )
柯西分配 (the standard Cauchy Distribution)

Demo Code

時間序列分析 (time series analysis) notebook
- 自相關性 (autocorrelation)
- 平穩性質 (stationariness) 與單根檢定 (unit root test)
- 自回歸模型 (autoregressive model, AR)
- 移動平均模型 (moving average model, MA)
- ARMA
  $(p, q)$ 與 ARIMA
  $(p, d, q)$ 模型
貝氏機率 (Bayesian probability) pdf
- Marc Garcia, Bayesian inference tutorial: a hello world example, 2020
- Samuel Hinton, Bayesian Linear Regression in Python, 2019
- 參考材料：
  - (FYR) 從經驗中學習 - 直觀理解貝氏定理及其應用 link
  - (FYR) 別再瞎猜、靠運氣！NASA、微軟都在用「貝式理論」做決策 link
  - (FYR) Chapter 12: Bayesian Inference, Statistical Machine Learning, CMU pdf
  - Introduction to Bayesian Modeling with PyMC3
  - Bayes’ Rule With Python
  - Monty Hall Problem
  - https://www.astronomy.swin.edu.au/~cblake/StatsLecture4.pdf
  - https://astrostatistics.psu.edu/RLectures/IntroBayes-1.pdf
  - https://cse.buffalo.edu/~jcorso/t/CSE555/files/lecture_bayesiandecision.pdf
  - 貝氏統計學的概念.pdf

機器學習導論

(FYR) Deep Mind: A Documentary File youtube
回歸分析 (regression) notebook
- Ridge regression
- LASSO regression
- Logistic regression
支持向量機 (support vector machine, SVM)
決策樹 (decision tree) 與隨機森林 (random forest)
主成分分析 (principal component analysis, PCA)
- https://setosa.io/ev/principal-component-analysis/
K-means clustering
增強式學習 (reinforcement learning): Q-Learning
深度學習 (deep learning)
- https://www.kaggle.com/learn/intro-to-deep-learning
案例學習：Jacky Hsueh, 為什麼需要經濟理論來預測經濟趨勢:比較機器學習與計量經濟, 2021.2.26
Vapnik–Chervonenkis dimension
Receiver operating characteristic (ROC)

Lab 5 K-Means 演算法實作
撰寫一個程式實現 K-Means 演算法。樣板程式內已經能夠產生測試用的資料，如下方左圖所示。K-Means的基本精神在於透過距離的遠近來歸納組別。演算法步驟可以參考此連結。該演算法分群的結果如下方右圖所示，其中紅色菱形的符號代表該群的算術中心點。我的目標是希望學員可以實現基本的K-Means演算法。注意，分群結果 (右圖) 沒有保證會跟正確答案 (左圖) 相同，故本程式的重點是演算法的實作。

統計實務

無母數分析
- 等級相關
  - Spearman 等級相關係數
  - Kendall 等級相關係數
- 單一母體
  - 符號檢定 (sign test)
  - Wilcoxcon 符號等級檢定
- 兩相依母體
  - 配對符號檢定
  - Wilcoxcon 配對符號等級檢定
- 兩獨立母體
  - Wilcoxon 等級和檢定
  - Mann-Whitney U 檢定
- 多獨立母體
  - Kruskal-Wallis 檢定
- 多相依母體
  - Friedman 檢定
- 隨機性檢定
  - 連檢定
核密度函數估計 (Kernel density estimation, KDE) link

小樣本分析 (small-sample analysis)
- Fisher test link

雙峰/多峰分佈 (bimodal/multimodal distribution)
- https://en.wikipedia.org/wiki/Mixture_model
極值理論 (Extreme value theory, EVT)
- https://en.wikipedia.org/wiki/Heavy-tailed_distribution
- https://en.wikipedia.org/wiki/Extreme_value_theory

候選題目

華人有冬天進補的文化，若今年冬天溫度特別的低，請問是否會影響到食品類股的價格上揚?
Markov Chain Monte Carlo (MCMC)
當沖金額佔當日交易金額的比例增加時，是否意味著行情即將轉空？
流行病學模型 https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology
到底是缺電?還是超用?
- 挖礦的電力消耗占比?
時間序列版本的
$R^{2}$ 。
- $R^{2}$ 為一個 beta 分佈，開 n 次方之後為一個常態分佈。
布朗運動
- 布朗橋 (Brownian bridge)
- Arcsine law: https://scipython.com/blog/the-arcsine-law/
War in Ukraine: https://ourworldindata.org/ukraine-war

資料來源

台灣政府公開資料

政府開放資料中心：https://data.gov.tw/
臺北市資料大平臺：https://data.taipei/
中央氣象局公開資料：https://opendata.cwb.gov.tw/dataset/observation
薪情體驗：https://earnings.dgbas.gov.tw/experience_sub_01.aspx
https://www.numbeo.com/cost-of-living/
國家發展委員會人口推估查詢系統：https://pop-proj.ndc.gov.tw/index.aspx
內政部統計處：https://www.moi.gov.tw/stat/index.aspx
內政部不動產交易實價查詢：https://lvr.land.moi.gov.tw/homePage.action
- 用程式分析房地產可行嗎？房價分析看這裡！ by FinLab
文化部資料開放服務網 https://opendata.culture.tw
台灣電力公司 https://www.taipower.com.tw/tc/index.aspx
- 政府資料開放平臺資料集清單 - 台灣電力股份有限公司 link
彩券相關
- 超讚的樂透網：https://zan01.com/
- 樂透堂：http://www.9800.com.tw/

國外公開資料來源

U.S. Census Bureau: https://www.census.gov/
World Bank: https://data.worldbank.org/
NASA: https://nasa.github.io/data-nasa-gov-frontpage/data_visualizations.html
Data World: https://data.world/
Human Development Reports: http://www.hdr.undp.org/en
Sports Reference: https://www.baseball-reference.com/
Data bank of Bank of England: https://www.bankofengland.co.uk/statistics

分析平台

Google Data Studio: https://datastudio.google.com/

競賽平台

參考資料

書籍

曾經使用過的教科書

Thomas Haslwanter, An Introduction to Statistics with Python, 2016 可在台大校園IP範圍內進行下載！
José Unpingco, Python for Probability, Statistics, and Machine Learning, 2/e, 2016
link 可在台大校園IP範圍內進行下載！
Jake VanderPlas, Python Data Science Handbook, 2016 online github

機率論

Sheldon Ross, Introduction to Probability Models, 12/e, 2019

數理統計

Robert V. Hogg, Joseph W. McKean, and Allen T. Craig, Introduction to Mathematical Statistics, 8/e, 2019
George Casella and Roger L. Berger, Statistical Inference, 2/e, 2001

實驗設計

Douglas C. Montgomery, Design and Analysis of Experiments, 9/e, 2017
Angela Dean, Daniel Voss, and Danel Draguljić, Design and Analysis of Experiments, 2017

統計學通論

Barbara Blatchley, Statistics in Context, 2018
張翔與廖崇智，提綱挈領學統計，第八版，2019/6/14
許誠哲，統計學：重點觀念與題解，2018/3/1
David M. Lane and etc, Online Statistics Education: http://onlinestatbook.com/Online_Statistics_Education.pdf