資料科學入門 (Introduction to Data Science)
原課程名:當統計學與程式相遇 (Learning Statistics & Programming)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
"The purpose of computing is insight, not numbers."
-- Richard W. Hamming
"Data do not speak for themselves;
there is always an interpreter, or a translator."
-- John W. Ratcliffe
"Remember that all models are wrong;
the practical question is how wrong do they have to be to not be useful."
-- George Box
"It is easy to lie with statistics, but easier to lie without them."
-- Frederick Mosteller
"Science is more than a body of knowledge;
it is a way of thinking.
The method of science, as stodgy and grumpy as it may seem,
is far more important than the findings of science."
-- Carl Sagan
講者訊息
- 盧政良 (Zheng-Liang Lu, Arthur)
- 聯絡方式:arthurzllu@gmail.com
工作環境
本課程不限制程式語言,但課程將以 Python 作為示範;學員可使用 Excel、R、或者 MATLAB 進行課程內容,惟須自行尋找對應的工具來完成問題。
預備知識
- 四則運算與代數運算
- 生活經驗與公民道德
- (Optional) 微積分
- 台大數學系朱樺老師 微積分
- 3Blue1Brown, Essence of Calculus
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- (Optional) 線性代數
- Stephen H. Friedberg, Arnold J. Insel, Lawrence E. Spence, Linear Algebra, 5/e, 2018
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 3Blue1Brown, Essence of Linear Algebra on Google Youtube
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
本課程所牽涉到的數學,目前只需要了解其脈絡與結果,並依照個人的興趣與能力決定是否需要熟練推導或計算的細節;Python 的套件中已經實現多數的數學結果,故繁瑣的計算可交付給電腦完成。
學習目標
- 統計學
- 了解統計工具與計算原理
- 正確解釋統計結果
- 合理預測資料的趨勢
- 排除統計謬誤
- 程式能力
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
評分標準
實體課程版
- 期末專題成果發表
- 重點項目:提問、資料收集與視覺化、模型假設、實驗結果、結論。
- 分組原則:一人一組,報告以投影片或 jupyter notebook 進行。
- 完成五次程式作業或完成期末專題報告的學員可獲頒本課程之證書。
- 請將作業寄信至 arthurzllu@gmail.com 並註明課程名稱與學員姓名。
線上課程版
- 完成五次程式作業的學員可以取得本課程之證書。
- 作業繳交方式為將 jupyter notebook 上傳到 NTU COOL。
授課對象
- 欲學習使用統計方法、量化研究的大專院校生、相關科研人員與工程師。
- 國高中生可,已學習過基礎統計學者佳 (108 課綱高二機率與統計 I 與高三機率與統計 II)。
主要參考書目
- Steven S Skiena, The Data Science Design Manual, 2017
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- Laura Igual and Santi Seguí, Introduction to Data Science, 2017
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
- 陳旭昇,統計學:應用與進階,第三版,2015
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
課程大綱
- Python 程式基礎
- 資料擷取與視覺化
- 機率論導論與常見的機率模型
- 統計檢定
- 點估計與區間估計
- 大數法則與中央極限定理
- 回歸模型
- 時間序列分析
- 貝氏機率
- 機器學習簡介
- (Optional) 統計實務
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
課程內容
Python 程式能力速成
- Python 程式能力速成 notebook
- 資料型態與基礎運算
- 有條件的敘述
- 重複的敘述
- 函式
- 補充材料
- 與程式能力相關的額外訊息 pdf
- 自學程式的 app
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
資料擷取與視覺化
- Pandas notebook
- Python data analysis library link
- 資料預處理
- 資料視覺化
Lab 1 使用 Pandas
(1) 計算每隻股票過去一年中日變化率超過 的次數。
(2) 按發生的次數排名這些資產。
(3) 將排名的結果存檔為 csv 文件。
(4) 繪製前 3 名贏家和前 3 名輸家的時間序列圖;通過使用 rebase() 對每個時間序列進行標準化來比較這些資產的相對表現 (即算出損益)。
(5) 製作散佈圖,觀察損益和發生次數 (即日變化率超過 的次數) 之間是否存在某些關係。
(6) 將散佈圖另存為 pdf 文件。
Image Not Showing
Possible Reasons
- The image was uploaded to a note which you don't have access to
- The note which the image was originally uploaded to has been deleted
Learn More →
樣板 / 參考解答
機率論
- 古典機率 pdf
- 一些重要的專有名詞:樣本空間、事件、機率公設、機率測度、條件機率、獨立事件
- 機率等於零,代表不會發生?
- Cantor set
- 台大數學系陳俊全老師給了一個有趣的類比:Cantor set 就很像窮人的口袋。要吃飯要看病,口袋總是能夠拿得出一些錢;但是問身家財產是多少時,和為零。
- 隨機變數 pdf / code
- 離散隨機變數:白努利分配、二項式分配
- 連續隨機變數 pdf:均勻分配、常態分配、 分配、Student's t 分配、F 分配
- 可於 SciPy 的文件中找到已經實作的機率模型:
- 機率族譜 pic
- 亂數生成 (random number generation, RNG)
- 期望值與多變量隨機變數 pdf
- 集中趨勢:算術平均數 (arithmetic mean)、幾何平均數 (geometric mean)、中位數 (median)、眾數 (mode)
- 變異程度:變異數 (variance)、標準差 (standard deviation)、全距 (full range)
- 高階動差:偏態 (skewness)、峰態 (kurtosis)
- 正 (負) 偏態:平均值高 (低) 於中位數
- 峰態 > 3:相較於常態分佈具有厚尾 (heavy tail) 現象
- (FYR) 動差生成函數 (moment generating function, mgf) pdf
- 共變異數與相關係數 (covariance & correlation)
- Zero correlation implies independence?
- 條件期望值 (conditional expectation / variance)
- Law of Total Variance wiki
- 獨立同分配 (iid, independent and identically distributed)

Lab 2 甚麼是期望值?
假設隨機變數 遵從下列的分佈:

則可知 。請寫一個程式模擬此分佈抽出的樣本,其樣本平均值會逼近期望值,當樣本大小從 1 到 3000。

Demo Code
統計學框架
- 抽樣方法與樣本分配 pdf
- 統計檢定 pdf
- 關鍵字們:虛無/對立假設 (null / alternative hypothesis)、p-value、顯著水準 (significance level)、拒絕區 (rejecting region)、型一/二/三誤差 (type I / II / III errors)
- SciPy上的案例 link
- 額外閱讀的材料 pdf1, pdf2
- 案例:
- 獨立檢定 ( independence test) new

- 線性迴歸 (linear regression) notebook
- Python套件 statsmodels link
- 補充說明:
- Interpreting Results from Linear Regression – Is the data appropriate? link
- About errors and residuals wiki
- 常態分配檢定 (normality tests) pdf
- How to detect the multicollinearity?
- How to detect the heteroscedasticity?
- Weighted least-square (WLS) method, one of Generalized least-square (GLS) method
- 更多案例:
- Financial market
- 更多關於類別 (categorical) 資料的迴歸 link
- Data transformations link
- Log transformation: for size data
- Square-root transformation: for count data
- Arcsine transformation: normalize to 0 ~ 1, for example, volume (cumulative) distribution across trading hours.
- Ramsey RESET test
If the proposed model is adequate, then the standardized residuals should be a white noise.
It tests whether non-linear combinations of the fitted values help explain the response variable.

Lab 3 簡單線性回歸
TBA

- 參數估計
- 點估計 (point estimation) pdf
- 方法
- 類比法
- 動差法 (method of moments)
- 最大概似估計 (maximum likelihood estimation, MLE)
- 好的估計式至少具備三個性質:
- 無偏 (unbiased)
- 有效率 (efficient)
- 一致 (consistent)
- 最佳線性無偏估計式 (best Linear unbiased estimator, BLUE)
… the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are uncorrelated, have equal variances and expectation value of zero.
- 充分統計量 (sufficient statistic) 與最小變異數不偏估計 (uniformly minimum-variance unbiased estimator, UMVUE)
- 區間估計 (interval estimation) pdf
- 什麼是 95% 的信賴區間 (confidence interval)?

- 變異數分析 (analysis of variance, ANOVA) pdf notebook
- Why not t-test? link
Another measure to compare the samples is called a t-test. When we have only two samples, t-test and ANOVA give the same results. However, using a t-test would not be reliable in cases where there are more than 2 samples. If we conduct multiple t-tests for comparing more than two samples, it will have a confounding effect on the error rate of the result.
- 更多案例們:

- 漸進理論 pdf
- 收斂性 (convergence)
- 大數法則 (Law of Large Number, LLN)
- 中央極限定理 (Central Limiting Theorem, CLT)
The fact that sampling distributions can approximate a normal distribution has critical implications. In statistics, the normality assumption is vital for parametric hypothesis tests of the mean, such as the t-test. Consequently, you might think that these tests are not valid when the data are nonnormally distributed. However, if your sample size is large enough, the central limit theorem kicks in and produces sampling distributions that approximate a normal distribution. This fact allows you to use these hypothesis tests even when your data are nonnormally distributed—as long as your sample size is large enough. See link.
- 補充材料:
Lab 4 檢驗中央極限定理
撰寫一個程式模擬自下列不同的分佈中抽取不同大小的樣本,找出最小的樣本大小使其樣本分佈不被常態檢定 (normality test) 拒絕:
- 標準均勻分配
- 卡方分配 ( distribution with df = 3)
- Poisson分配 (Poisson distribution with )
- 柯西分配 (the standard Cauchy Distribution)


Demo Code

機器學習導論

Lab 5 K-Means 演算法實作
撰寫一個程式實現 K-Means 演算法。樣板程式內已經能夠產生測試用的資料,如下方左圖所示。K-Means的基本精神在於透過距離的遠近來歸納組別。演算法步驟可以參考此連結。該演算法分群的結果如下方右圖所示,其中紅色菱形的符號代表該群的算術中心點。我的目標是希望學員可以實現基本的K-Means演算法。注意,分群結果 (右圖) 沒有保證會跟正確答案 (左圖) 相同,故本程式的重點是演算法的實作。

統計實務
- 無母數分析
- 等級相關
- Spearman 等級相關係數
- Kendall 等級相關係數
- 單一母體
- 符號檢定 (sign test)
- Wilcoxcon 符號等級檢定
- 兩相依母體
- 配對符號檢定
- Wilcoxcon 配對符號等級檢定
- 兩獨立母體
- Wilcoxon 等級和檢定
- Mann-Whitney U 檢定
- 多獨立母體
- 多相依母體
- 隨機性檢定
- 核密度函數估計 (Kernel density estimation, KDE) link

- 小樣本分析 (small-sample analysis)

- 雙峰/多峰分佈 (bimodal/multimodal distribution)
- 極值理論 (Extreme value theory, EVT)
候選題目
資料來源
台灣政府公開資料
國外公開資料來源
分析平台
競賽平台
參考資料
書籍
曾經使用過的教科書
機率論
數理統計
實驗設計
統計學通論
時間序列
機器學習
- Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, An Introduction to Statistical Learning with Applications in R, 2013

- Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2009

- Ovidiu Calin, Deep Learning Architectures, 2020

科學普及閱讀
國外內課程
雜項