# MIT 6.0002 You can find complete python code from [MIT 6.0002(introduction to computational thinking and data science)-fall-2016](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/lecture-slides-and-files/) ## introduction ### computational models 1. optimization * to be maximized(or minimized) * a set of constraints - 可為空,越多限制大多時刻簡化複雜度 * local "optimal" - 使用greedy * use dynamic programming could find correct solution(global optimum) * optional study: * [stanford for high school](https://web.stanford.edu/group/sisl/k12/optimization/MO-unit1-pdfs/1.1optimization.pdf) * linear programming * newton’s method * penalty method * ...etc * [wikipedia](https://en.wikipedia.org/wiki/Mathematical_optimization#Major_subfields) 2. statistical 3. simulation * [Stochastic Thinking and Random Walks](https://hackmd.io/EeTggTSOSnmt2eUcerjDEA?view#Stochastic-Thinking-and-Random-Walks) ## graph-theoretic models two members to build graph: 1. node(vertice) 2. edge(arc) * undirected * directed * weighted or not #### using python ```python class node(object): def __init__(self,name): self.name = name def get_name(self): return self.name def __str__(self): return self.name class edge(object): def __init__(self,src,dst): self.src = src self.dst = dst def get_source(self): return self.src def get_destination(self): return self.dst def __str__(self): return self.src.get_name() + '->'\ + self.dst.get_name() ``` ### Common representations of Digraphs Digraph #### using python ```python class digraph(object): """ map each node to a list """ def __init__(self): self.edges={} def add_node(self,node): if node in self.edges: print("duplicate node") else: self.edges[node] = [] def add_edge(self, edge): src = edge.get_source() dst = edge.get_destination() if not (src in self.edges and dest in self.edges): self.add_node(src) self.add_node(dst) self.edges[src].append(dst) def children_of(self, node): return self.edges[node] def has_node(self,node): return node in self.edges def get_node(self,name): for n in self.edges: if n.get_name() == name: return n raise NameError(name) def __str__(self): ret = '' for src in self.edges: for dst in self.edges[src]: ret = ret + src.get_name() + '->'\ + dst.get_name() + '\n' return ret[:-1] ``` ```python class graph(digraph): def add_edge(self,edge): """ ????? """ digraph.add_edge(self,edge) rev = edge(edge.get_destinatioon(),edge.get_source()) digraph.add_edge(self,rev) ``` Can use the graph to find shortest path 1. DFS 2. left-first 3. [BFS](http://www.geeksforgeeks.org/breadth-first-traversal-for-a-graph/) 補充:[A+ search](http://www.geeksforgeeks.org/a-search-algorithm/) #### dfs ```python def DFS(graph, start, end, path, shortest, toPrint): path = path + [start] if toPrint: print('path:',printPath(path)) for node in graph.childen_of(start): if node not in path: if shortest == None or len(path) < len(shortest): newPath = DFS(graph,node,end,path,shortedt,toPrint) if newPath != None: shortest = newPath elif toPrint: print("visited",node) return shortest def shortest_path(graph, s, end, toPrint = False): return DFS(graph, s, end, [], None, toPrint) ``` ## Stochastic Thinking and Random Walks #### about physical word(physics) * Newtonian Mechanics * Copenhagen Interpretation * nondeterminism * cannot be predicted ... So ### Stochastic Processes * unpredictable * random process ```python import random def rollDie(): """returns a random int between 1 and 6""" return random.choice([1,2,3,4,5,6]) def testRoll(n = 10): result = '' for i in range(n): result = result + str(rollDie()) print(result) ``` * Probability of Various Results * ${1\over 6^n}$ * range in 0 to 1 * Independence * each event does not influence the others * A Simulation of Die Rolling ```python def runSim(goal, numTrials, txt): total = 0 for i in range(numTrials): result = '' for j in range(len(goal)): result += str(rollDie()) if result == goal: total += 1 print('Estimated probability of', txt, '=',round(1/(6**len(goal)), 8)) actProbability = round(total/numTrials, 8) print('Actual Probability of', txt, '=',round(actProbability, 8)) runSim('11111', 1000, '11111') ``` * estimate value is approaching zero * Approximating Using a Simulation * The Birthday Problem * The probability of at least two people in a group having the same birthday * birthday paradox * i and j both fall on day $r$ * Pr$\{ b_i = r\ \&\ b_j = r \}$ = ${ 1 \over n^2}$, in uniformly distributed * fall on same day * Pr$\{ b_i = b_j = s \}$ = ${ \sum_{s=1}^n{1 \over n^2}} = {1\over n}$, in uniformly distributed * It assumed that all birthday probabilities are equally likely. * ${1 - {366!\over 366^N*(366-N)!}}$ * 1 pigeonhole principle * ${366!\over 366^N*(366-N)!}$ probability of each one are different birthday * general by introduction to algorithms, Cormen * $Pr \{ B_k\} = Pr\{B_{k-1}\}Pr\{A_k|B_{k-1}\} ,B_k = \bigcap_{i=1}^k{A_i}$ and $A_k=\{person_k != person_j | \forall j<k \}$ * by conditional probability, $Pr\{A_k|B_{k-1}\} = {(n-k+1) \over n}$ * than $Pr \{ B_k \} = Pr \{ B_1\}Pr\{B_{2}|B_1\}...Pr\{A_k|B_{k-1}\} = 1(1-{1\over n})(1-{2 \over n})...(1-{k-1 \over n})$ * if $n = 365$, than $Pr \{ B_k\} = 1-{365!\over 365^k*(365-k)!}$ * however, we want use inequality to approach * $1-{k-1 \over n} \leq{ e^{-1(k-1)\over n}}$,by Taylor expansions * $Pr\{B_k\} \leq e^{-k(k-1)\over 2n}$ * Using indicator random variables by introduction to algorithms, Cormen * $E[X_{ij}] = {1\over n}$, $X_{ij}$ is $I\{$person $i$ and $j$ have same birthday$\}$ * $X = \sum_{i=1}^k\sum_{j=i+1}^k{X_{ij}}$, than $E[X] =\sum_{i=1}^k\sum_{j=i+1}^k{E[X_{ij}]} = \binom{k}{2}{1 \over n} = {k(k-1)\over 2n}$ ```pyhton import math def sameDate(numPeople, numSame): possibleDates = range(366) birthdays = [0]*366 for p in range(numPeople): birthDate = random.choice(possibleDates) birthdays[birthDate] += 1 return max(birthdays) >= numSame def birthdayProb(numPeople, numSame, numTrials): numHits = 0 for t in range(numTrials): if sameDate(numPeople, numSame): numHits += 1 return numHits/numTrials for numPeople in [10, 20, 40, 100]: print('For', numPeople,'est. prob. of a shared birthday is', birthdayProb(numPeople, 2, 10000)) numerator = math.factorial(366) denom = (366**numPeople)*math.factorial(366-numPeople) print('Actual prob. for N = 100 =',1 - numerator/denom) ``` * hard mathematically problem * 2 peoples have same birthday * 1 - ${each\_one\_birthday\_are\_different}$ * 3 problem * 1 - (1-${2problem}$) - .... * Principle of Inclusion and Exclusion * but using simulation is easy * All Dates Are Not Equally Likely... ? * Simulations * “All models are wrong, but some are useful.” – George Box * https://en.wikipedia.org/wiki/Category:Stochastic_processes * Random Walk * [Brownian Motion book by Peter M ̈orters and Yuval Peres](https://www.stat.berkeley.edu/~peres/bmbook.pdf) * [drunkards walk explained](https://medium.com/i-math/the-drunkards-walk-explained-48a0205d304) --- ### Random walk 透過random walk可以知道如何透過simulation來得到事件的機率。 * drunkards walk * 每次有上下左右,四種方向可走。 * n steps ?? * use simulation * set location and get location ```python class Location(object): def __init__(self, x, y): """x and y are floats""" self.x = x self.y = y def move(self, deltaX, deltaY): """deltaX and deltaY are floats The move method will create and return a Location object. """ return Location(self.x + deltaX, self.y + deltaY) def getX(self): return self.x def getY(self): return self.y def distFrom(self, other): """ get the distance between two location objects """ xDist = self.x -other.getX() yDist = self.y -other.getY() return (xDist**2 + yDist**2)**0.5 def __str__(self): return '<' + str(self.x) + ', '\ +str(self.y) + '>' ``` * drunk - Usual Drunk or Masochist Drunk ```python """ This is a virtual object """ class Drunk(object): def __init__(self, name = None): """Assumes name is a str""" self.name = name def __str__(self): if self != None: return self.name return 'Anonymous' ``` ```python import random """ inherite Drunk object """ class UsualDrunk(Drunk): def takeStep(self): stepChoices = [(0,1), (0,-1), (1, 0), (-1, 0)] """ up down right left """ return random.choice(stepChoices) class MasochistDrunk(Drunk): def takeStep(self): stepChoices = [(0.0,1.1), (0.0,-0.9), (1.0, 0.0), (-1.0, 0.0)] """ intend to go up """ return random.choice(stepChoices) ``` * Field ```python class Field(object): def __init__(self): """create a drunks list""" self.drunks = {} def addDrunk(self, drunk, loc): """add a drunk into list""" if drunk in self.drunks: raise ValueError('Duplicate drunk') else: self.drunks[drunk] = loc def getLoc(self, drunk): if drunk not in self.drunks: raise ValueError('Drunk not in field') return self.drunks[drunk] def moveDrunk(self, drunk): if drunk not in self.drunks: raise ValueError('Drunk not in field') """apply usual or masochist method""" xDist, yDist = drunk.takeStep() #use move method of Location to get new location """update new location object""" self.drunks[drunk] =\ self.drunks[drunk].move(xDist, yDist) I ``` * Simulation * Simulating a Single Walk * Assumes: * `f` a Field * `d` a Drunk in `f` * `numSteps` is an interger >= 0 * Moves `d` * numSteps times * returns the distance between the final location and the location at the start of the walk. ```python def walk(f, d, numSteps): start = f.getLoc(d) for s in range(numSteps): f.moveDrunk(d) return start.distFrom(f.getLoc(d)) ``` * Simulating Multiple Walks * Assumes: * `numSteps` is an int >= 0 * `numTrials` is an int > 0 * `dClass` is a subclass of Drunk * Simulates numTrials walks of numSteps steps each. * `walk()` is a single work simulation * Returns a list of the final distances for each trial ```python def simWalks(numSteps, numTrials, dClass): Homer = dClass() origin = Location(0, 0) distances = [] for t in range(numTrials): f = Field() f.addDrunk(Homer, origin) distances.append(round(walk(f, Homer, numTrials), 1)) return distances ``` * Testing * [lect5.py](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/lecture-slides-and-files/lect5.py) * you can observe it by youself~ * Mathematical meaning * [Markov chain in random walk](https://en.wikipedia.org/wiki/Random_walk#As_a_Markov_chain) * [about the Brownian motion in Chinese](http://episte.math.ntu.edu.tw/articles/mm/mm_09_3_03/page2.html) --- ## Monte Carlo simulation ### Background * Inferential statistics * Population: a set of examples * Sample: a proper subset of a population * Key fact: a random sample tends to exhibit the same properties as the population from which it is drawn * Monte carlo simulation is an important estimate method to get the approximate solution * You can filp up a coin in different times to statistic * Confidence? * depend on 1. Size of sample 2. Variance of sample * Roulette simulation * object class ```pyhton class FairRoulette(): def __init__(self): self.pockets = [] for i in range(1,37): self.pockets.append(i) self.ball = None self.pocketOdds = len(self.pockets) -1 def spin(self): self.ball = random.choice(self.pockets) def betPocket(self, pocket, amt): if str(pocket) == str(self.ball): return amt*self.pocketOdds else: return -amt def __str__(self): return 'Fair Roulette' ``` * Simulation method ```python def playRoulette(game, numSpins, pocket, bet): totPocket = 0 for i in range(numSpins): game.spin() totPocket += game.betPocket(pocket, bet) if toPrint: print(numSpins, 'spins of', game) print('Expected return betting', pocket, '=',\ str(100*totPocket/numSpins) + '%\n') return (totPocket/numSpins) ``` * Uses it for Monte Carlo Simulation ```python game = FairRoulette() for numSpins in (100, 1000000): for i in range(3): playRoulette(game, numSpins, 2, 1, True) ``` 100 spins of Fair Roulette Expected return betting 2 = -100.0% 100 spins of Fair Roulette Expected return betting 2 = 44.0% 100 spins of Fair Roulette Expected return betting 2 = -28.0% 1000000 spins of Fair Roulette Expected return betting 2 = -0.046% 1000000 spins of Fair Roulette Expected return betting 2 = 0.602% ...etc * Observation * larger numbers of spin is closer `totPocket` * In repeated independent tests with the same actual probability `p` of a particular outcome in each test, the chance that the fraction of times that outcome occurs differs from `p` converges to zero as the number of trials goes to infinity - from lecture slide * You should know in statistics * Gambler’s Fallacy * [wikipedia](https://en.wikipedia.org/wiki/Gambler%27s_fallacy) * Do not be confused 1. Independent events 2. Non-independent events * Roulett - the result you can watch on the lecture ```python class EuRoulette(FairRoulette): def __init__(self): FairRoulette.__init__(self) self.pockets.append('0') def __str__(self): return 'European Roulette' class AmRoulette(EuRoulette): def __init__(self): EuRoulette.__init__(self) self.pockets.append('00') def __str__(self): return 'American Roulette' ``` * Regression to the Mean(RTM) * [Reading from nctu - 現代啟示錄](https://ccjou.wordpress.com/2014/06/10/回歸均值/) * Standard Deviation ```python def getMeanAndStd(X): mean = sum(X)/float(len(X)) tot = 0.0 for x in X: tot += (x -mean)**2 std = (tot/len(X))**0.5 return mean, std ``` * Variance * Confidence interval - next section * Central limit theorem - next section * Empirical rule * Applying Empirical Rule - you can find the simulation result on the lecture ```pyhton def findPocketReturn(game, numTrials, trialSize, toPrint): pocketReturns = [] for t in range(numTrials): trialVals = playRoulette(game, trialSize, 2, 1, toPrint) pocketReturns.append(trialVals) return pocketReturns resultDict = {} games = (FairRoulette, EuRoulette, AmRoulette) for G in games: resultDict[G().__str__()] = [] for numSpins in (100, 1000, 10000): print('\nSimulate betting a pocket for', numTrials, 'trials of', numSpins, 'spins each') for G in games: pocketReturns = findPocketReturn(G(), 20, numSpins, False) mean, std = getMeanAndStd(pocketReturns) resultDict[G().__str__()].append((numSpins, 100*mean, 100*std)) print( 'Exp. return for', G(), '=', str(round(100*mean, 3)) + '%,', '+/-' + str(round(100*1.96*std, 3)) + '% with 95% confidence') ``` * Assumptions Underlying Empirical Rule * The mean estimation error is zero * The normal distribution * [Distribution]( https://en.wikipedia.org/wiki/Probability_distribution) * [PDF](https://en.wikipedia.org/wiki/Probability_density_function) * [CDF](https://en.wikipedia.org/wiki/Cumulative_distribution_function) * Normal distribution * PDF is ${P(x) ={ {1\over \sigma{\sqrt2\pi} }e^{- {{(x-\mu )^2 \over 2\sigma^2}}} }}$ , $e$ is $Euler$'s number * Sampling Space of Possible Outcomes * Depends upon ==++variability++== in underlying distribution --- ### Confidence-interval [The Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution) * Gernerate Normal distrubution data * [random.gauss](https://docs.python.org/3/library/random.html#random.gauss) * [numpy.random.normal](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.normal.html) * [scipy.stats.norm](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) * [self-learning](https://docs.scipy.org/doc/scipy/reference/tutorial/integrate.html) * Central Limit Theorem(CTL) 1. sufficiently large sample 2. independent random variables 3. the means will be approximately normally distributed ```python def plotMeans(numDice, numRolls, numBins, legend, color, style): means = [] for i in range(numRolls//numDice): vals = 0 for j in range(numDice): vals += 5*random.random() means.append(vals/float(numDice)) pylab.hist(means, numBins, color = color, label = legend, weights = [1/len(means)]*len(means), hatch = style) return getMeanAndStd(means) ``` * when computing confidence interval * If the sample is sufficient large number, the CTL allows us to use the empirical rule * [Buffon's needle](https://en.wikipedia.org/wiki/Buffon%27s_needle) * we can use Monte Carlo to do simulate to estimate $\pi$ ```python def throwNeedles(numNeedles): inCircle = 0 for Needles in range(1, numNeedles + 1, 1): x = random.random() y = random.random() if (x*x + y*y)**0.5 <= 1.0: inCircle += 1 return 2*(inCircle/float(numNeedles)) def getEst(numNeedles, numTrials): estimates = [] for t in range(numTrials): piGuess = throwNeedles(numNeedles) estimates.append(piGuess) sDev = numpy.std(estimates) curEst = sum(estimates)/len(estimates) print('Est. = ' + str(curEst) +\ ', Std. dev. = ' + str(round(sDev, 6))\ + ', Needles = ' + str(numNeedles)) return (curEst, sDev) def estPi(precision, numTrials): numNeedles = 1000 sDev = precision while sDev >= precision/2: curEst, sDev = getEst(numNeedles, numTrials) numNeedles *= 2 return curEst random.seed(0) estPi(0.005, 100) ``` ## Sampling and standard error * recall * With Monte Carlo simulation we can generate lots of random samples, and use them to compute confidence intervals - `lecture slide` ### Probability Sampling * has same probability to be chosen #### Stratified * using subgroup * `TBD` ### conclusion of sampling * larger sampling is better ## Understanding Experimental Data * Statistics meet experimental science * Hooke's law * $F = -kd$ * if k = 35,000N/m * F = 350N, if d=0.01m * how much weight can compress 1cm * $F = ma$, mass and acceleration * $Mass = F/a = 350N/g = {350N \over 9.8m/s^2}(N=kg・m/s2) =$ 35.68 kilograms * data for finding k ``` Distance Mass 0.0865 0.1 0.1015 0.15 0.1106 0.2 0.1279 0.25 0.1892 0.3 0.2695 0.35 0.2888 0.4 0.2425 0.45 0.3465 0.5 0.3225 0.55 0.3764 0.6 0.4263 0.65 0.4562 0.7 ``` * you can see the data distribution in the lecture slide * Fitting curves to data * objective function * measure way * example: measuring distance ``` O---X---------╱ | \ ╱ Y P ╱ | \ ╱ | ╱ | ╱ | ╱ ╱ ``` * [Least squares objective function](https://en.wikipedia.org/wiki/Least_squares) * $\sum^{len(observed)-1}_{i=0}{(observed[i]-predicted[i])^2}$ * [Regression](https://en.wikipedia.org/wiki/Regression_analysis) * Can use [linear regression](https://en.wikipedia.org/wiki/Linear_regression) to slove it * use pylab.polyfit(x,y,n) ```python def fitData(fileName): xVals, yVals = getData(fileName) xVals = pylab.array(xVals) yVals = pylab.array(yVals) xVals = xVals*9.81 #get force """ blue point in lecture slide """ pylab.plot(xVals, yVals, 'bo', label = 'Measured points') labelPlot() """ use polyfit to regress X and Y in one domain """ a,b = pylab.polyfit(xVals, yVals, 1) """ Y = aX + b """ estYVals = a*pylab.array(xVals) + b print('a =', a, 'b =', b) pylab.plot(xVals, estYVals, 'r', label = 'Linear fit, k = ' + str(round(1/a, 5))) """ or use polyyval to caculate estYVals = pylab.polyval(model, xVals) pylab.plot(xVals, estYVals, 'r', label = 'Linear fit, k = ' + str(round(1/model[0], 5))) """ pylab.legend(loc = 'best') ``` * multi-degree * use pylab.polyfit(x,y,n) * when $n \geq 2$ * use mean squared error to compare ```python def aveMeanSquareError(data, predicted): error = 0.0 for i in range(len(data)): error += (data[i] - predicted[i])**2 return error/len(data) ``` * [r-squared](https://en.wikipedia.org/wiki/Coefficient_of_determination) * ${R^2 = 1 - {\sum_i{(y_i-p_i)^2}\over \sum_i{(y_i-\mu)^2}}}$ ```python def rSquared(observed, predicted): error = ((predicted - observed)**2).sum() meanError = error/len(observed) return 1 - (meanError/numpy.var(observed)) ``` * cross validata * [overfitting](https://en.wikipedia.org/wiki/Overfitting) * split dataset to dataset1 and dataset2 * use dataset1(2) to predict dataset2(1) and use r-squared to get the best module * to avoid overfitting ## Introduction to machine learning * Early definition * [Samuel, Arthur L. Computer Games I. Springer, New York, NY. pp. 335–365.](https://link.springer.com/chapter/10.1007/978-1-4613-8716-9_14) #### What is machine learning ![](https://i.imgur.com/NjVOMMz.png) reference: [MIT6_0002F16_lec11.pdf](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/lecture-slides-and-files/MIT6_0002F16_lec11.pdf) - p.5 1. Observe set * traning data 2. Infer generated data from traning data 3. To make prediction * test data * Variations on paradigm * `Supervised`: * [a set has label pairs](https://en.wikipedia.org/wiki/Supervised_learning) * `Unsupervised`: * [a set only has feature vectors](https://en.wikipedia.org/wiki/Unsupervised_learning) * Example of classifying(Supervised) and clustering(Unsupervised) * data:![](https://i.imgur.com/1VFtV2t.png) * classifying: ![](https://i.imgur.com/kLszSI4.png) * clustering: ![](https://i.imgur.com/nHJ9SPI.png) #### Clustering is an optimization problem * two popular methods * hierarchical clustering - worst case $O(n^3)$ 1. assign each item to a cluster 2. find the closet pair and merge together * [Minkowski distanc](https://en.wikipedia.org/wiki/Minkowski_distance) * [distance in ML - chinese](http://www.cnblogs.com/heaad/archive/2011/03/08/1977733.html) 4. continue 1. and 2. until it remains N cluster * k-means clustering - greedy algorithm 1. chose k sample as cluster centroids randomly 2. average each cluster and re-assign 3. continue 1. and 2. until the cluster is not changed * only can find different local optimas * How to choose K * run hierarchical clustering on subset of data #### Classification * Supervised learning * Regression * e.g., linear regression * Classification * predicting the lable 1. average the distance with different cluster 2. chose the nearest cluster by distance * Advantages and demerits * advantages * no explicit training * fast * demerits * memory intensive and predictions will take a long time * [brute force](https://www.slideshare.net/ssuserf88631/knn-51511604) ## Damned lies and statistics * Garbage in, Garbage out