--- tags: statistics --- # 資料科學入門 (Introduction to Data Science) <font size = +1 color = "gray">原課程名:當統計學與程式相遇 (Learning Statistics & Programming)</font> <center> ![](https://i2.wp.com/www.prosancons.com/wp-content/uploads/2018/04/041518_1320_Prosandcons1.png =400x) </center> <p style="text-align: right"> "The purpose of computing is insight, not numbers."<br>-- Richard W. Hamming </p> <p style="text-align: right"> "All models are wrong, but some models are useful."<br>-- George Box </p> <p style="text-align: right"> "A data scientist is someone who knows more statistics than a computer scientist<br>and more computer science than a statistician."<br>-- Josh Blumenstock </p> <p style="text-align: right"> "It is easy to lie with statistics, but easier to lie without them."<br>-- Frederick Mosteller </p> ### 講者訊息 - 盧政良 (Zheng-Liang Lu, Arthur) - 聯絡方式:arthurzllu@gmail.com ### 工作環境 - Google Colab https://colab.research.google.com/ :::warning 程式語言不限,但課程將以[Python](https://hackmd.io/@arthurzllu/HJNXq84SO)為例;可使用Excel、R、或者MATLAB來進行,但須自行尋找對應的工具來完成任務。 ::: ### 預備知識 - 四則運算、代數運算 - 日常生活經驗與公民道德 - 微積分、線性代數 (optional) - 微積分 - 台大數學系朱樺老師 [微積分](http://www.math.ntu.edu.tw/~hchu/Calculus/) - 3Blue1Brown, [Essence of Calculus](https://www.youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr) ![](https://i.imgur.com/1ZDNtsh.png =200x) - 線性代數 - Stephen H. Friedberg, Arnold J. Insel, Lawrence E. Spence, [Linear Algebra](https://www.amazon.com/Linear-Algebra-5th-Stephen-Friedberg/dp/0134860241/), 5/e, 2018 ![](https://i.imgur.com/Si0JvQb.png =100x) - 3Blue1Brown, [Essence of Linear Algebra](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) on Google Youtube ![](https://i.imgur.com/tnuPzSf.png =200x) :::warning 本課程所牽涉到的數學,只需要了解其脈絡與結果,無須擔心推導或計算的細節;本課程的目標是讓這些繁瑣的計算交給電腦。並非說數學不重要,<font color = "red">數學很重要</font>,但是不是現在我們關注的事情。未來時間允許的話,這些數學理論會是你繼續邁向資料科學家的基礎。 ::: ### 學習目標 * 統計學 - 了解統計工具與計算原理 - 正確解釋統計結果 - 合理預測資料的趨勢 - 排除統計謬誤 * 程式能力 - 掌握資料處理流程 - 學習創造自己的工具 <center> ![](https://i.imgur.com/nVXc2TN.png =300x) </center> ### 成果發表 - 報告重點 - 提問 - 資料收集與視覺化 - 模型假設 - 實驗結果 - 結論 - 分組:原則一人一組,報告以**投影片**或**jupyter notebook**進行。 ### 授課對象 1. 欲學習使用++統計方法++、++量化研究++的大專院校生、相關科研人員與工程師 2. 國高中生可,已學習過基礎統計學者佳 (108課綱高二的機率與統計I與高三的機率與統計II) <center> ![](https://i.imgur.com/UOBvF5W.png =200x) </center> ## 主要材料 - Steven S Skiena, [The Data Science Design Manual](https://www.springer.com/gp/book/9783319554433), 2017 ![](https://i.imgur.com/JrFu6Hh.png =100x) - 陳旭昇,[統計學:應用與進階](http://homepage.ntu.edu.tw/~sschen/Book/Book1.html),第三版 ![](https://i.imgur.com/SveUJDN.png =100x) ## 課程摘要 0. Python程式基礎 1. 資料擷取與視覺化 2. 機率論導論與常見的機率模型 3. 統計檢定 4. 點估計與區間估計 5. 大數法則與中央極限定理 6. 回歸模型 7. 時間序列分析 8. 貝氏機率 9. 機器學習簡介 10. 統計實務 <center> ![](https://i.imgur.com/fj7sI9u.png =400x) </center> ## 課程內容 ### Python程式能力速成 - Python程式能力速成: [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_0_python_programming.ipynb) - 資料型態與基礎運算 - 有條件的敘述 - 重複的敘述 - 函式 - 與程式能力相關的額外訊息 [pdf](https://www.csie.ntu.edu.tw/~d00922011/python/slides/cs_preliminary_knowledge.pdf) <center> ![](https://i.imgur.com/sMkTwsp.png =300x) </center> ### 資料擷取與視覺化 - Pandas: [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_data_acquisition_visualization.ipynb) - Python data analysis library [link](https://pandas.pydata.org/) - (FYR) https://www.kaggle.com/learn/pandas - (FYR) Cheat sheet: [link1](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf), [link2](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Pandas_Cheat_Sheet_2.pdf) - 資料預處理 - 案例 1: [code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_1.ipynb) with [data.csv](https://www.csie.ntu.edu.tw/~d00922011/stats/data/data.csv) - 案例 2: [financial time series](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_2_financial_time_series.ipynb) - 案例 3: [json file](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_1_pandas_tutorial_3_json.ipynb) - (FYR) 字串處理 - 正規表示法 (regular expression) - 不錯的互動式教學網站 https://regexone.com/ - 原生套件:https://docs.python.org/3/library/re.html - (FYR) Pythonic data cleaning with numpy and pandas [link]() - (FYR) https://www.kaggle.com/learn/data-cleaning - 視覺化 - Matplotlib官方文件 [link](https://matplotlib.org/contents.html) - (FYR) http://scipy-lectures.org/intro/matplotlib/index.html - Cheat sheets by DataCamp: [pdf](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Python_Matplotlib_Cheat_Sheet.pdf) - 一個不錯的教學文件 [Nicolas P. Rougier](http://www.labri.fr/perso/nrougier/teaching/matplotlib/) - (FYR) https://www.kaggle.com/learn/data-visualization :::success <b><font size = 4>++Lab 1++</font> 利用pandas計算投報率</b> 利用套件ffn抓取FAANG ("fb, aapl, amzn, nflx, goog") 的收盤價,時間範圍為2020年第一個交易日到最後一個交易日 (今天)。假設本人於第一個交易日每一檔股票各買入一元美金,採取買了抱著不放 (buy-and-hold) 的策略,一直持有這五個資產到今日。請問資產價值之累積報酬的時間序列為何? 請將自己持有部位的價值之累積報酬的時間序列繪製成線圖,並將五個資產的價格時間序列合併自己部位的價值時間序列輸出到 Excel 檔。你會用到計算報酬率的公式如下: <br> <center> Return rate (%) = $\dfrac{P_t - P_{t - 1}}{P_{t - 1}} \times 100$ </center> 請利用DataFrame中的 rebase() 來進行計算。 <br> <center> ![](https://i.imgur.com/LRFvq4F.png =500x) [Demo Code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab1_demo.ipynb) </center> ::: ### 機率論 - 古典機率 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture1_ProbabilityModel_2019.pdf) - 一些重要的專有名詞:樣本空間、事件、機率公設、機率測度、條件機率、獨立事件 - 隨機變數 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture2_RandomVariable_2019.pdf) [code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_2_probability_models_and_random_number_generators.ipynb) - 離散隨機變數:白努利分配、二項式分配、Poisson分配 [pdf](http://www.stats.ox.ac.uk/~filippi/Teaching/psychology_humanscience_2015/lecture5.pdf) - 連續隨機變數 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture5_Normal_2019.pdf):均勻分配、常態分配、$\chi^2$ 分配、Student's t分配、F分配 - 可於SciPy的文件中找到已經實作的機率模型: - https://docs.scipy.org/doc/scipy/reference/stats.html - https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html - 不同機率分佈的關係 [pic](https://upload.wikimedia.org/wikipedia/commons/thumb/6/69/Relationships_among_some_of_univariate_probability_distributions.jpg/1920px-Relationships_among_some_of_univariate_probability_distributions.jpg) - 期望值與多變量隨機變數 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture4_Moments_2019.pdf) - (FYR) Taylor expansion [wiki](https://en.wikipedia.org/wiki/Taylor_series) - 集中趨勢:算術平均數 (arithmetic mean)、中位數 (median)、眾數 (mode) - 變異程度:變異數 (variance)、標準差 (standard deviation) - 高階動差:偏態 (skewness)、峰態 (kurtosis) - 共變異數與相關係數 (covariance & correlation) - Zero correlation implies independence? - 獨立事件 (independence) - 所謂的獨立同分配 (aka **iid**, **i**ndependent and **i**dentically **d**istributed) - 條件期望值 (conditional expectation/variance) - Law of Total Variance [wiki](https://en.wikipedia.org/wiki/Law_of_total_variance) <center> ![](https://i.imgur.com/HyJiGlg.png =500x) </center> :::success <b><font size = 4>++Lab 2++</font> 亂數產生</b> 實作一個以線性同餘演算法 ([linear congruential generator](https://en.wikipedia.org/wiki/Linear_congruential_generator)) <font color ="red">偽</font>亂數產生器。該演算法透過一個遞迴方程式 $x_n = (a \times x_{n - 1} + c) \% m$,使得電腦可以有效率地產生遵從均勻分配的亂數。為了方便,參數為$a = 1664525$,$c = 1013904223$ 與 $m = 2^{32}$。你的工作是撰寫一個函式,輸入參數為亂數種子 $x_0$ (任意的正整數)、亂數數量 $n$,輸出為長度 $n$、介於 $0$ 到 $1$ 之間的亂數。為了確認樣本的分佈,可繪製樣本的直方圖以視覺的方式檢查;或可以使用 scipy.stats.kstest(samples, "uniform") 的指令進行檢定。 <br> <center> ![](https://i.imgur.com/NzNnpcb.png) [Demo Code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab2_demo.ipynb) </center> 補充閱讀:偽亂數 (pseudo randomness) [link](https://en.wikipedia.org/wiki/Pseudorandomness) ::: :::success <b><font size = 4>++Lab 3++</font> 期望值</b> 假設隨機變數 $Y$ 遵從下列的分佈: <center> ![](https://i.imgur.com/rBFWcDO.png =300x) </center> 則可知 $\mathbb{E}(Y) = 0.9.$ 請寫一個程式模擬從此分佈抽出的樣本,其樣本平均值會逼近期望值,當樣本大小從 1 到 3000。 <center> ![](https://i.imgur.com/6rX2hqO.png =400x) [Demo Code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab3_demo.ipynb) </center> ::: ### 統計學框架 - 抽樣方法與樣本分配 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture6_Sampling_2019.pdf) - https://en.wikipedia.org/wiki/Sampling_(statistics) - 統計檢定 [pdf](http://www.stats.ox.ac.uk/~filippi/Teaching/psychology_humanscience_2015/lecture8.pdf) - 關鍵字們:虛無/對立假設 (null/alternative hypothesis)、p-value、顯著水準 (significance level)、拒絕區 (rejecting region)、型一/二/三誤差 (type I/II/III errors) - SciPy上的案例 [link](https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/) - 額外補充的材料 [pdf1](https://www2.isye.gatech.edu/~yxie77/isye2028/lecture8.pdf), [pdf2](http://www.sci.utah.edu/~arpaiva/classes/UT_ece3530/hypothesis_testing.pdf) - 鄉民案例:第一次見面吃飯的AA制實際測驗 [link](https://www.ptt.cc/bbs/Boy-Girl/M.1596763829.A.89D.html) - 實用的檢定 - 常態分配檢定 (normality tests) [pdf](http://webspace.ship.edu/pgmarr/Geo441/Lectures/Lec%205%20-%20Normality%20Testing.pdf) [link](https://link.springer.com/referenceworkentry/10.1007%2F978-3-642-04898-2_421) - Jarque-Bera test [link](https://link.springer.com/referenceworkentry/10.1007/978-3-642-04898-2_319) - Omnibus test [link](https://link.springer.com/referenceworkentry/10.1007%2F978-3-642-04898-2_426) - 獨立檢定 ($\chi^2$ independence test) [link](https://www.statology.org/chi-square-test-of-independence/) <font color = "red" size = -1>new</font> <center> ![](https://imgs.xkcd.com/comics/null_hypothesis.png =200x) </center> - 線性迴歸 (linear regression) [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/stat_3_linear_regression.ipynb) [pdf](https://www.csie.ntu.edu.tw/~d00922011/python/slides/linear_regression.pdf) - Python套件 **statsmodels** [link](https://www.statsmodels.org/stable/regression.html) - 補充說明: - Interpreting Results from Linear Regression – Is the data appropriate? [link](https://www.accelebrate.com/blog/interpreting-results-from-linear-regression-is-the-data-appropriate) - About errors and residuals [wiki](https://en.wikipedia.org/wiki/Errors_and_residuals) - 更多案例: - Buffett's alpha by AQR Capital Management [link](https://www.aqr.com/Insights/Research/Journal-Article/Buffetts-Alpha) [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Buffett_Alpha.pdf) [方格子解說 Buffett's alpha](https://vocus.cc/tarcy2801/5d6fc9fdfd89780001ca4e42) - 更多關於類別 (categorical) 資料的迴歸 [link](https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/) - Data transformations [link](http://www.biostathandbook.com/transformation.html) - Log transformation: for size data - Square-root transformation: for count data - Arcsine transformation - [Ramsey RESET test](https://en.wikipedia.org/wiki/Ramsey_RESET_test) > It tests whether non-linear combinations of the fitted values help explain the response variable. <center> ![](https://imgs.xkcd.com/comics/linear_regression_2x.png =300x) </center> - 參數估計 - 點估計 (point estimation) [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture8_PointEst_2019.pdf) - 類比法 - 動差法 (method of moments) - 最大概似估計 (maximum likelihood estimation, MLE) - 區間估計 (interval estimation) [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture9_IntervalEst_2019.pdf) - 什麼是95%的信賴區間 (confidence interval)? - Bootstrapping [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_3_bootstrapping.ipynb) <center> ![](https://i.imgur.com/BXhjfPu.png =300x) </center> - 變異數分析 (analysis of variance, ANOVA) [pdf](http://amath2.nchu.edu.tw/honda/605Lecture/Lecture10_ANOVA.pdf) [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/stat_4_ANOVA_example.ipynb) - Why not t-test? [link](https://www.analyticsvidhya.com/blog/2018/01/anova-analysis-of-variance/) > Another measure to compare the samples is called a t-test. When we have only two samples, t-test and ANOVA give the same results. However, using a t-test would not be reliable in cases where there are more than 2 samples. If we conduct multiple t-tests for comparing more than two samples, it will have a **confounding effect** on the error rate of the result. - Confounding effect: https://www.scribbr.com/methodology/confounding-variables/ - 更多案例們: - One-way ANOVA: https://www.pythonfordatascience.org/anova-python/ - Two-way ANOVA: http://www.pybloggers.com/2016/03/three-ways-to-do-a-two-way-anova-with-python/ <center> ![](https://i.imgur.com/pp5QR5A.png =400x) </center> - 漸進理論 [pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/Chen/Lecture7_Asymptotics_2019.pdf) - 收斂性 - 大數法則 - 中央極限定理 > The fact that sampling distributions can approximate a normal distribution has critical implications. In statistics, the normality assumption is vital for parametric hypothesis tests of the mean, such as the t-test. Consequently, you might think that these tests are not valid when the data are nonnormally distributed. However, if your sample size is large enough, the central limit theorem kicks in and produces sampling distributions that approximate a normal distribution. This fact allows you to use these hypothesis tests even when your data are nonnormally distributed—as long as your sample size is large enough. See [link](https://statisticsbyjim.com/basics/central-limit-theorem/). :::success <b><font size = 4>++Lab 4++</font> 檢驗中央極限定理</b> 撰寫一個程式模擬自下列不同的分佈中抽取不同大小的樣本,找出最小的樣本大小使其樣本分佈不被常態檢定 (normality test) 拒絕: 1. 標準均勻分配 2. 卡方分配 ($\chi^2$ distribution with df = 3) 3. Poisson分配 (Poisson distribution with $\mu = 3$) 4. 柯西分配 (the standard Cauchy Distribution) <center> ![](https://i.imgur.com/tbSdeux.png =200x) ![](https://i.imgur.com/QaZbsMg.png =200x) ![](https://i.imgur.com/mc4zGvn.png =200x) ![](https://i.imgur.com/q1sD7p6.png =200x) [Demo Code](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_lab4_demo.ipynb) </center> ::: <center> ![](https://i.imgur.com/EZwESD1.png =250x) </center> - 時間序列分析 (time series analysis) [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_4_time_series.ipynb) - 自相關性 (autocorrelation) - 定態性質 (stationariness) 與單根檢定 (unit root test) - ARMA$(p, q)$模型 - Necessary conditions for ARMA$(p, q)$ model: https://math.unice.fr/~frapetti/CorsoP/chapitre_23_IMEA_1.pdf - Granger因果檢定 - 結構改變 (structural break) [pdf](https://ssc.wisc.edu/~bhansen/crete/crete5.pdf) <font color = "red" size = -1>new</font> - 貝氏機率 (Bayesian probability) [pdf](http://pillowlab.princeton.edu/teaching/mathtools16/slides/lec13_BayesRule.pdf) - Marc Garcia, [Bayesian inference tutorial: a hello world example](https://datapythonista.me/blog/bayesian-inference-tutorial-a-hello-world-example.html), 2020 - Samuel Hinton, [Bayesian Linear Regression in Python](https://cosmiccoding.com.au/tutorials/bayes_lin_reg), 2019 - 參考材料: - (FYR) 從經驗中學習 - 直觀理解貝氏定理及其應用 [link](https://leemeng.tw/intuitive-understandind-of-bayes-rules-and-learn-from-experience.html) - (FYR) 別再瞎猜、靠運氣!NASA、微軟都在用「貝式理論」做決策 [link](https://buzzorange.com/techorange/2019/07/24/nasa-how-to-make-right-decision/) - (FYR) Chapter 12: Bayesian Inference, Statistical Machine Learning, CMU [pdf](http://www.stat.cmu.edu/~larry/=sml/Bayes.pdf) - [Introduction to Bayesian Modeling with PyMC3](https://juanitorduz.github.io/intro_pymc3/) - [Bayes’ Rule With Python](http://jim-stone.staff.shef.ac.uk/BookBayes2012/bookbayesch01WithPython.pdf) - [Monty Hall Problem](https://en.wikipedia.org/wiki/Monty_Hall_problem) - https://www.astronomy.swin.edu.au/~cblake/StatsLecture4.pdf - https://astrostatistics.psu.edu/RLectures/IntroBayes-1.pdf - https://cse.buffalo.edu/~jcorso/t/CSE555/files/lecture_bayesiandecision.pdf - [貝氏統計學的概念.pdf](https://www.csie.ntu.edu.tw/~d00922011/stats/slides/貝氏統計學的概念.pdf) ### 機器學習導論 - (FYR) Deep Mind: A Documentary File [youtube](https://www.youtube.com/watch?v=WXuK6gekU1Y) - 回歸分析 (regression) [notebook](https://www.csie.ntu.edu.tw/~d00922011/stats/notebook/lsp_5_machine_learning_tutorial.ipynb) - Ridge regression - LASSO regression - Logistic regression - 支持向量機 (support vector machine, SVM) - 決策樹 (decision tree) 與隨機森林 (random forest) - 主成分分析 (principal component analysis, PCA) - https://setosa.io/ev/principal-component-analysis/ - K-means clustering - 增強式學習 (reinforcement learning): Q-Learning - 深度學習 (deep learning) - https://www.kaggle.com/learn/intro-to-deep-learning - 案例學習:Jacky Hsueh, [為什麼需要經濟理論來預測經濟趨勢:比較機器學習與計量經濟](http://economicsnote.com/%E7%82%BA%E4%BB%80%E9%BA%BC%E9%9C%80%E8%A6%81%E7%B6%93%E6%BF%9F%E7%90%86%E8%AB%96%E4%BE%86%E9%A0%90%E6%B8%AC%E7%B6%93%E6%BF%9F%E8%B6%A8%E5%8B%A2%E6%AF%94%E8%BC%83%E6%A9%9F%E5%99%A8%E5%AD%B8%E7%BF%92/), 2021.2.26 - [Vapnik–Chervonenkis dimension](https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension) <center> ![](https://i.imgur.com/PY21A4V.png =350x) </center> ### 統計實務 - 無母數分析 - 等級相關 - Spearman 等級相關係數 - Kendall 等級相關係數 - 單一母體 - 符號檢定 (sign test) - Wilcoxcon 符號等級檢定 - 兩相依母體 - 配對符號檢定 - Wilcoxcon 配對符號等級檢定 - 兩獨立母體 - Wilcoxon 等級和檢定 - Mann-Whitney U 檢定 - 多獨立母體 - Kruskal-Wallis 檢定 - 多相依母體 - Friedman 檢定 - 隨機性檢定 - 連檢定 - 核密度函數估計 (Kernel density estimation, KDE) [link](https://scikit-learn.org/stable/modules/density.html) <center> ![](https://i.imgur.com/ATECGO7.png =400x) </center> - 小樣本分析 (small-sample analysis) - Fisher test [link](http://blog.pulipuli.info/2017/05/fishers-exact-test-example.html) <center> ![](https://i.imgur.com/JyiaDyG.jpg =250x) </center> - 雙峰/多峰分佈 (bimodal/multimodal distribution) - https://en.wikipedia.org/wiki/Mixture_model - 極值理論 (Extreme value theory, EVT) - https://en.wikipedia.org/wiki/Heavy-tailed_distribution - https://en.wikipedia.org/wiki/Extreme_value_theory ## 候選題目 - 華人有冬天進補的文化,若今年冬天溫度特別的低,請問是否會影響到食品類股的價格上揚? - Markov Chain Monte Carlo (MCMC) - 當沖金額佔當日交易金額的比例增加時,是否意味著行情即將轉空? - 流行病學模型 https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology - 到底是缺電?還是超用? - 挖礦的電力消耗占比? ## 資料來源 ### 台灣政府公開資料 - 政府開放資料中心:https://data.gov.tw/ - 臺北市資料大平臺:https://data.taipei/ - 中央氣象局公開資料:https://opendata.cwb.gov.tw/dataset/observation - 薪情體驗:https://earnings.dgbas.gov.tw/experience_sub_01.aspx - https://www.numbeo.com/cost-of-living/ - 國家發展委員會人口推估查詢系統:https://pop-proj.ndc.gov.tw/index.aspx - 內政部統計處:https://www.moi.gov.tw/stat/index.aspx - 內政部不動產交易實價查詢:https://lvr.land.moi.gov.tw/homePage.action - 用程式分析房地產可行嗎?房價分析看這裡! by [FinLab](https://www.finlab.tw/real-estate-analasys-histograms/) - 文化部資料開放服務網 https://opendata.culture.tw - 台灣電力公司 https://www.taipower.com.tw/tc/index.aspx - 政府資料開放平臺資料集清單 - 台灣電力股份有限公司 [link](https://sheethub.com/data.gov.tw/%E6%94%BF%E5%BA%9C%E8%B3%87%E6%96%99%E9%96%8B%E6%94%BE%E5%B9%B3%E8%87%BA%E8%B3%87%E6%96%99%E9%9B%86%E6%B8%85%E5%96%AE/i/44/%E5%8F%B0%E7%81%A3%E9%9B%BB%E5%8A%9B%E8%82%A1%E4%BB%BD%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B8) - 彩券相關 - 超讚的樂透網:https://zan01.com/ - 樂透堂:http://www.9800.com.tw/ ### 國外公開資料來源 - U.S. Census Bureau: https://www.census.gov/ - World Bank: https://data.worldbank.org/ - NASA: https://nasa.github.io/data-nasa-gov-frontpage/data_visualizations.html - Data World: https://data.world/ - Human Development Reports: http://www.hdr.undp.org/en - Sports Reference: https://www.baseball-reference.com/ - Data bank of Bank of England: https://www.bankofengland.co.uk/statistics ### 分析平台 - Google Data Studio: https://datastudio.google.com/ ### 競賽平台 - https://www.kaggle.com/datasets/ - https://zindi.africa/competitions ## 參考資料 ### 書籍 #### 曾經使用過的教科書 - Thomas Haslwanter, [An Introduction to Statistics with Python](https://www.springer.com/us/book/9783319283159), 2016 <font size = -1 color = "gray">可在台大校園IP範圍內進行下載!</font> ![](https://i.imgur.com/ypOd5fQ.png =100x) - José Unpingco, [Python for Probability, Statistics, and Machine Learning](https://link.springer.com/book/10.1007%2F978-3-319-30717-6), 2/e [link](https://link.springer.com/book/10.1007%2F978-3-030-18545-9 ) <font size = -1 color = "gray">可在台大校園IP範圍內進行下載!</font> ![](https://i.imgur.com/vdfeIre.png =100x) - Jake VanderPlas, [Python Data Science Handbook](https://www.oreilly.com/library/view/python-data-science/9781491912126/), 2016 <font size = -1>[online](https://github.com/jakevdp/PythonDataScienceHandbook)</font> <font size = -1>[github](https://github.com/jakevdp/PythonDataScienceHandbook)</font> ![](https://i.imgur.com/Bdf0PTa.png =100x) #### 機率論 - Sheldon Ross, [Introduction to Probability Models](https://www.elsevier.com/books/introduction-to-probability-models/ross/978-0-12-814346-9), 12/e, 2019 ![](https://i.imgur.com/57W4fGF.jpg =100x) #### 數理統計 - Robert V. Hogg, Joseph W. McKean, and Allen T. Craig, [Introduction to Mathematical Statistics](https://www.amazon.com/-/zh_TW/dp/0134686993/), 8/e, 2019 ![](https://i.imgur.com/2juN4uq.png =100x) - George Casella and Roger L. Berger, [Statistical Inference](https://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126), 2/e, 2001 ![](https://i.imgur.com/D9fxPRb.png =100x) #### 實驗設計 - Douglas C. Montgomery, [Design and Analysis of Experiments](https://www.wiley.com/en-gb/Design+and+Analysis+of+Experiments,+9th+Edition-p-9781119320937), 9/e, 2017 ![](https://i.imgur.com/KH0xEM8.jpg =100x) - Angela Dean, Daniel Voss, and Danel Draguljić, [Design and Analysis of Experiments](https://link.springer.com/book/10.1007/978-3-319-52250-0), 2017 ![](https://i.imgur.com/5JKXVjP.jpg =100x) #### 統計學通論 - Barbara Blatchley, [Statistics in Context](https://www.amazon.com/Statistics-Context-Barbara-Blatchley/dp/0190278951), 2018 ![](https://i.imgur.com/MFDwoYZ.png =100x) - 張翔與廖崇智,++提綱挈領學統計++,第八版,2019/6/14 ![](https://i.imgur.com/DyXbeFs.png =130x) - 許誠哲,++統計學:重點觀念與題解++,2018/3/1 ![](https://i.imgur.com/BdkBLQ0.png =90x) ![](https://i.imgur.com/2w0llwW.png =90x) - David M. Lane and etc, ++Online Statistics Education++: http://onlinestatbook.com/Online_Statistics_Education.pdf #### 時間序列 - 陳旭昇,[時間序列分析 - 總體經濟與財務金融之應用](http://homepage.ntu.edu.tw/~sschen/Book/Book2.htm),第二版 ![](https://i.imgur.com/b2Ubehb.png =100x) - ++Introduction to Econometrics with R++: https://www.econometrics-with-r.org/index.html #### 機器學習 - Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani, [An Introduction to Statistical Learning with Applications in R](https://link.springer.com/book/10.1007/978-1-4614-7138-7), 2013 ![](https://i.imgur.com/sCY7WYs.png =100x) - Trevor HastieRobert TibshiraniJerome Friedman, [The Elements of Statistical Learning: Data Mining, Inference, and Prediction](https://link.springer.com/book/10.1007/978-0-387-84858-7), 2009 ![](https://i.imgur.com/u1Mi1mt.png =100x) ### 科學普及閱讀 - History of Statistics: https://www.york.ac.uk/depts/maths/histstat/ ![](https://i.imgur.com/UzJP2o3.png =100x) - 安德魯·維克斯,++34個讓你豁然開朗的統計學小故事++,2019/03/28 ![](https://i.imgur.com/QYCD6Fo.png =140x) - 羅伯特·艾貝爾森,++一位耶魯大學教授的統計箴言++,2019/05/28 ![](https://i.imgur.com/okSuBrz.png =100x) - 塚本邦尊...等,++東京大學資料科學家養成全書:使用Python動手學習資料分++ ![](https://i.imgur.com/FebESuA.png =150x) ### 國外內課程 - Prof. Shiu-Sheng Chen (http://homepage.ntu.edu.tw/~sschen/) - Dr. Shao-Wei Cheng (http://www.stat.nthu.edu.tw/~swcheng/) - ++Statistics++: http://www.stat.nthu.edu.tw/~swcheng/Teaching/math2820/index.php - ++Probability++: http://www.stat.nthu.edu.tw/~swcheng/Teaching/math2810/index.php - ++Experimental Design and Analysis++: http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5510/index.php - ++Mathematical Statistics++: http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat3875/index.php - ++Discrete Analysis++: http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5230/index.html - ++STA-663-2017++: http://people.duke.edu/~ccc14/sta-663-2017/ - Dr. Peter Kempthorne, ++Lecture notes on Probability and Statistics++: http://users.encs.concordia.ca/~doedel/courses/comp-233/slides.pdf - ++Mathematical Statistic++s: https://ocw.mit.edu/courses/mathematics/18-655-mathematical-statistics-spring-2016/lecture-notes/ - Emmanuel Candès, https://statweb.stanford.edu/~candes/ - ++Theory of Statistics++: https://statweb.stanford.edu/~candes/teaching/stats300c/ - ++Modern Markov Chain++: https://statweb.stanford.edu/~candes/teaching/stats318/ - Oleg Melnikov, ++Introduction to Statistical Inference++: http://stats200.stanford.edu/ - Liam Paninski, ++Computational Statistics++: http://www.stat.columbia.edu/~liam/teaching/compstat-spr19/ - David Aldous, ++Probability and the Real World++: https://www.stat.berkeley.edu/~aldous/157 - Bocheng Jing, ++Sta102 - Intro Biostatistics++: https://www2.stat.duke.edu/courses/Spring13/sta102.001/ ### 雜項 - Secretary problem - https://zh.wikipedia.org/wiki/%E7%A7%98%E6%9B%B8%E5%95%8F%E9%A1%8C - https://style.udn.com/style/story/8073/1452739 - http://www.statslife.org.uk/images/pdf/timeline-of-statistics.pdf - 王超辰:醫學統計學 https://bookdown.org/ccwang/medical_statistics6/ - Ioane Muni Toke, ++An Introduction to Hawkes Processes with Applications to Finance++, 2011: http://lamp.ecp.fr/MAS/fiQuant/ioane_files/HawkesCourseSlides.pdf - https://ourworldindata.org/grapher/annual-working-hours-vs-gdp-per-capita-pwt - https://wol.iza.org/uploads/articles/228/pdfs/female-education-and-its-impact-on-fertility.pdf <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQy4hVPBShjYibAF4SvDXxctkOVNOIiho5Wx56nXTP4sonS-XFHZL2galgDbGdRzc4DrjQPo_8pabrK/pubhtml?gid=691888123&amp;single=true&amp;widget=true&amp;headers=false" width = "700px" height = "650px"></iframe> <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQy4hVPBShjYibAF4SvDXxctkOVNOIiho5Wx56nXTP4sonS-XFHZL2galgDbGdRzc4DrjQPo_8pabrK/pubhtml?gid=673696095&amp;single=true&amp;widget=true&amp;headers=false" width = "700px" height = "400px"></iframe> <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQy4hVPBShjYibAF4SvDXxctkOVNOIiho5Wx56nXTP4sonS-XFHZL2galgDbGdRzc4DrjQPo_8pabrK/pubhtml?gid=1391095507&amp;single=true&amp;widget=true&amp;headers=false" width = "700px" height = "500px"></iframe> <iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQy4hVPBShjYibAF4SvDXxctkOVNOIiho5Wx56nXTP4sonS-XFHZL2galgDbGdRzc4DrjQPo_8pabrK/pubhtml?gid=0&amp;single=true&amp;widget=true&amp;headers=false" width = "700px" height = "500px"></iframe>