# MIT 6.0002
You can find complete python code from [MIT 6.0002(introduction to computational thinking and data science)-fall-2016](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/lecture-slides-and-files/)
## introduction
### computational models
1. optimization
* to be maximized(or minimized)
* a set of constraints
- 可為空,越多限制大多時刻簡化複雜度
* local "optimal"
- 使用greedy
* use dynamic programming could find correct solution(global optimum)
* optional study:
* [stanford for high school](https://web.stanford.edu/group/sisl/k12/optimization/MO-unit1-pdfs/1.1optimization.pdf)
* linear programming
* newton’s method
* penalty method
* ...etc
* [wikipedia](https://en.wikipedia.org/wiki/Mathematical_optimization#Major_subfields)
2. statistical
3. simulation
* [Stochastic Thinking and Random Walks](https://hackmd.io/EeTggTSOSnmt2eUcerjDEA?view#Stochastic-Thinking-and-Random-Walks)
## graph-theoretic models
two members to build graph:
1. node(vertice)
2. edge(arc)
* undirected
* directed
* weighted or not
#### using python
```python
class node(object):
def __init__(self,name):
self.name = name
def get_name(self):
return self.name
def __str__(self):
return self.name
class edge(object):
def __init__(self,src,dst):
self.src = src
self.dst = dst
def get_source(self):
return self.src
def get_destination(self):
return self.dst
def __str__(self):
return self.src.get_name() + '->'\
+ self.dst.get_name()
```
### Common representations of Digraphs
Digraph
#### using python
```python
class digraph(object):
"""
map each node to a list
"""
def __init__(self):
self.edges={}
def add_node(self,node):
if node in self.edges:
print("duplicate node")
else:
self.edges[node] = []
def add_edge(self, edge):
src = edge.get_source()
dst = edge.get_destination()
if not (src in self.edges and dest in self.edges):
self.add_node(src)
self.add_node(dst)
self.edges[src].append(dst)
def children_of(self, node):
return self.edges[node]
def has_node(self,node):
return node in self.edges
def get_node(self,name):
for n in self.edges:
if n.get_name() == name:
return n
raise NameError(name)
def __str__(self):
ret = ''
for src in self.edges:
for dst in self.edges[src]:
ret = ret + src.get_name() + '->'\
+ dst.get_name() + '\n'
return ret[:-1]
```
```python
class graph(digraph):
def add_edge(self,edge):
""" ????? """
digraph.add_edge(self,edge)
rev = edge(edge.get_destinatioon(),edge.get_source())
digraph.add_edge(self,rev)
```
Can use the graph to find shortest path
1. DFS
2. left-first
3. [BFS](http://www.geeksforgeeks.org/breadth-first-traversal-for-a-graph/)
補充:[A+ search](http://www.geeksforgeeks.org/a-search-algorithm/)
#### dfs
```python
def DFS(graph, start, end, path, shortest, toPrint):
path = path + [start]
if toPrint:
print('path:',printPath(path))
for node in graph.childen_of(start):
if node not in path:
if shortest == None or len(path) < len(shortest):
newPath = DFS(graph,node,end,path,shortedt,toPrint)
if newPath != None:
shortest = newPath
elif toPrint:
print("visited",node)
return shortest
def shortest_path(graph, s, end, toPrint = False):
return DFS(graph, s, end, [], None, toPrint)
```
## Stochastic Thinking and Random Walks
#### about physical word(physics)
* Newtonian Mechanics
* Copenhagen Interpretation
* nondeterminism
* cannot be predicted
... So
### Stochastic Processes
* unpredictable
* random process
```python
import random
def rollDie():
"""returns a random int between 1 and 6"""
return random.choice([1,2,3,4,5,6])
def testRoll(n = 10):
result = ''
for i in range(n):
result = result + str(rollDie())
print(result)
```
* Probability of Various Results
* ${1\over 6^n}$
* range in 0 to 1
* Independence
* each event does not influence the others
* A Simulation of Die Rolling
```python
def runSim(goal, numTrials, txt):
total = 0
for i in range(numTrials):
result = ''
for j in range(len(goal)):
result += str(rollDie())
if result == goal:
total += 1
print('Estimated probability of', txt, '=',round(1/(6**len(goal)), 8))
actProbability = round(total/numTrials, 8)
print('Actual Probability of', txt, '=',round(actProbability, 8))
runSim('11111', 1000, '11111')
```
* estimate value is approaching zero
* Approximating Using a Simulation
* The Birthday Problem
* The probability of at least two people in a group having the same birthday
* birthday paradox
* i and j both fall on day $r$
* Pr$\{ b_i = r\ \&\ b_j = r \}$ = ${ 1 \over n^2}$, in uniformly distributed
* fall on same day
* Pr$\{ b_i = b_j = s \}$ = ${ \sum_{s=1}^n{1 \over n^2}} = {1\over n}$, in uniformly distributed
* It assumed that all birthday probabilities are equally likely.
* ${1 - {366!\over 366^N*(366-N)!}}$
* 1 pigeonhole principle
* ${366!\over 366^N*(366-N)!}$ probability of each one are different birthday
* general by introduction to algorithms, Cormen
* $Pr \{ B_k\} = Pr\{B_{k-1}\}Pr\{A_k|B_{k-1}\} ,B_k = \bigcap_{i=1}^k{A_i}$ and $A_k=\{person_k != person_j | \forall j<k \}$
* by conditional probability, $Pr\{A_k|B_{k-1}\} = {(n-k+1) \over n}$
* than $Pr \{ B_k \} = Pr \{ B_1\}Pr\{B_{2}|B_1\}...Pr\{A_k|B_{k-1}\}
= 1(1-{1\over n})(1-{2 \over n})...(1-{k-1 \over n})$
* if $n = 365$, than $Pr \{ B_k\} = 1-{365!\over 365^k*(365-k)!}$
* however, we want use inequality to approach
* $1-{k-1 \over n} \leq{ e^{-1(k-1)\over n}}$,by Taylor expansions
* $Pr\{B_k\} \leq e^{-k(k-1)\over 2n}$
* Using indicator random variables by introduction to algorithms, Cormen
* $E[X_{ij}] = {1\over n}$, $X_{ij}$ is $I\{$person $i$ and $j$ have same birthday$\}$
* $X = \sum_{i=1}^k\sum_{j=i+1}^k{X_{ij}}$, than $E[X] =\sum_{i=1}^k\sum_{j=i+1}^k{E[X_{ij}]} = \binom{k}{2}{1 \over n} = {k(k-1)\over 2n}$
```pyhton
import math
def sameDate(numPeople, numSame):
possibleDates = range(366) birthdays = [0]*366
for p in range(numPeople):
birthDate = random.choice(possibleDates)
birthdays[birthDate] += 1
return max(birthdays) >= numSame
def birthdayProb(numPeople, numSame, numTrials):
numHits = 0
for t in range(numTrials):
if sameDate(numPeople, numSame):
numHits += 1
return numHits/numTrials
for numPeople in [10, 20, 40, 100]:
print('For', numPeople,'est. prob. of a shared birthday is',
birthdayProb(numPeople, 2, 10000))
numerator = math.factorial(366)
denom = (366**numPeople)*math.factorial(366-numPeople)
print('Actual prob. for N = 100 =',1 - numerator/denom)
```
* hard mathematically problem
* 2 peoples have same birthday
* 1 - ${each\_one\_birthday\_are\_different}$
* 3 problem
* 1 - (1-${2problem}$) - ....
* Principle of Inclusion and Exclusion
* but using simulation is easy
* All Dates Are Not Equally Likely... ?
* Simulations
* “All models are wrong, but some are useful.” – George Box
* https://en.wikipedia.org/wiki/Category:Stochastic_processes
* Random Walk
* [Brownian Motion book by Peter M ̈orters and Yuval Peres](https://www.stat.berkeley.edu/~peres/bmbook.pdf)
* [drunkards walk explained](https://medium.com/i-math/the-drunkards-walk-explained-48a0205d304)
---
### Random walk
透過random walk可以知道如何透過simulation來得到事件的機率。
* drunkards walk
* 每次有上下左右,四種方向可走。
* n steps ??
* use simulation
* set location and get location
```python
class Location(object):
def __init__(self, x, y):
"""x and y are floats"""
self.x = x
self.y = y
def move(self, deltaX, deltaY):
"""deltaX and deltaY are floats
The move method will create and
return a Location object.
"""
return Location(self.x + deltaX,
self.y + deltaY)
def getX(self):
return self.x
def getY(self):
return self.y
def distFrom(self, other):
"""
get the distance between two location objects
"""
xDist = self.x -other.getX()
yDist = self.y -other.getY()
return (xDist**2 + yDist**2)**0.5
def __str__(self):
return '<' + str(self.x) + ', '\
+str(self.y) + '>'
```
* drunk - Usual Drunk or Masochist Drunk
```python
"""
This is a virtual object
"""
class Drunk(object):
def __init__(self, name = None):
"""Assumes name is a str"""
self.name = name
def __str__(self):
if self != None:
return self.name
return 'Anonymous'
```
```python
import random
""" inherite Drunk object """
class UsualDrunk(Drunk):
def takeStep(self):
stepChoices = [(0,1), (0,-1), (1, 0), (-1, 0)]
""" up down right left """
return random.choice(stepChoices)
class MasochistDrunk(Drunk):
def takeStep(self):
stepChoices = [(0.0,1.1), (0.0,-0.9), (1.0, 0.0), (-1.0, 0.0)]
""" intend to go up """
return random.choice(stepChoices)
```
* Field
```python
class Field(object):
def __init__(self):
"""create a drunks list"""
self.drunks = {}
def addDrunk(self, drunk, loc):
"""add a drunk into list"""
if drunk in self.drunks:
raise ValueError('Duplicate drunk')
else:
self.drunks[drunk] = loc
def getLoc(self, drunk):
if drunk not in self.drunks:
raise ValueError('Drunk not in field')
return self.drunks[drunk]
def moveDrunk(self, drunk):
if drunk not in self.drunks:
raise ValueError('Drunk not in field')
"""apply usual or masochist method"""
xDist, yDist = drunk.takeStep() #use move method of Location to get new location
"""update new location object"""
self.drunks[drunk] =\
self.drunks[drunk].move(xDist, yDist) I
```
* Simulation
* Simulating a Single Walk
* Assumes:
* `f` a Field
* `d` a Drunk in `f`
* `numSteps` is an interger >= 0
* Moves `d`
* numSteps times
* returns the distance between the final
location and the location at the start
of the walk.
```python
def walk(f, d, numSteps):
start = f.getLoc(d)
for s in range(numSteps):
f.moveDrunk(d)
return start.distFrom(f.getLoc(d))
```
* Simulating Multiple Walks
* Assumes:
* `numSteps` is an int >= 0
* `numTrials` is an int > 0
* `dClass` is a subclass of Drunk
* Simulates numTrials walks of numSteps steps each.
* `walk()` is a single work simulation
* Returns a list of the final distances for each trial
```python
def simWalks(numSteps, numTrials, dClass):
Homer = dClass()
origin = Location(0, 0)
distances = []
for t in range(numTrials):
f = Field()
f.addDrunk(Homer, origin)
distances.append(round(walk(f, Homer, numTrials), 1))
return distances
```
* Testing
* [lect5.py](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/lecture-slides-and-files/lect5.py)
* you can observe it by youself~
* Mathematical meaning
* [Markov chain in random walk](https://en.wikipedia.org/wiki/Random_walk#As_a_Markov_chain)
* [about the Brownian motion in Chinese](http://episte.math.ntu.edu.tw/articles/mm/mm_09_3_03/page2.html)
---
## Monte Carlo simulation
### Background
* Inferential statistics
* Population: a set of examples
* Sample: a proper subset of a population
* Key fact: a random sample tends to exhibit the same properties as the population from which it is drawn
* Monte carlo simulation is an important estimate method to get the approximate solution
* You can filp up a coin in different times to statistic
* Confidence?
* depend on
1. Size of sample
2. Variance of sample
* Roulette simulation
* object class
```pyhton
class FairRoulette():
def __init__(self):
self.pockets = []
for i in range(1,37):
self.pockets.append(i)
self.ball = None
self.pocketOdds = len(self.pockets) -1
def spin(self):
self.ball = random.choice(self.pockets)
def betPocket(self, pocket, amt):
if str(pocket) == str(self.ball):
return amt*self.pocketOdds
else:
return -amt
def __str__(self):
return 'Fair Roulette'
```
* Simulation method
```python
def playRoulette(game, numSpins, pocket, bet):
totPocket = 0
for i in range(numSpins):
game.spin()
totPocket += game.betPocket(pocket, bet)
if toPrint:
print(numSpins, 'spins of', game)
print('Expected return betting', pocket, '=',\ str(100*totPocket/numSpins) + '%\n')
return (totPocket/numSpins)
```
* Uses it for Monte Carlo Simulation
```python
game = FairRoulette()
for numSpins in (100, 1000000):
for i in range(3):
playRoulette(game, numSpins, 2, 1, True)
```
100 spins of Fair Roulette Expected return betting 2 = -100.0%
100 spins of Fair Roulette Expected return betting 2 = 44.0%
100 spins of Fair Roulette Expected return betting 2 = -28.0%
1000000 spins of Fair Roulette Expected return betting 2 = -0.046%
1000000 spins of Fair Roulette Expected return betting 2 = 0.602%
...etc
* Observation
* larger numbers of spin is closer `totPocket`
* In repeated independent tests with the same actual probability `p` of a particular outcome in each test, the chance that the fraction of times that outcome occurs differs from `p` converges to zero as the number of trials goes to infinity - from lecture slide
* You should know in statistics
* Gambler’s Fallacy
* [wikipedia](https://en.wikipedia.org/wiki/Gambler%27s_fallacy)
* Do not be confused
1. Independent events
2. Non-independent events
* Roulett - the result you can watch on the lecture
```python
class EuRoulette(FairRoulette):
def __init__(self):
FairRoulette.__init__(self)
self.pockets.append('0')
def __str__(self):
return 'European Roulette'
class AmRoulette(EuRoulette):
def __init__(self):
EuRoulette.__init__(self)
self.pockets.append('00')
def __str__(self):
return 'American Roulette'
```
* Regression to the Mean(RTM)
* [Reading from nctu - 現代啟示錄](https://ccjou.wordpress.com/2014/06/10/回歸均值/)
* Standard Deviation
```python
def getMeanAndStd(X):
mean = sum(X)/float(len(X))
tot = 0.0
for x in X:
tot += (x -mean)**2
std = (tot/len(X))**0.5
return mean, std
```
* Variance
* Confidence interval - next section
* Central limit theorem - next section
* Empirical rule
* Applying Empirical Rule - you can find the simulation result on the lecture
```pyhton
def findPocketReturn(game, numTrials, trialSize, toPrint):
pocketReturns = []
for t in range(numTrials):
trialVals = playRoulette(game,
trialSize, 2, 1,
toPrint)
pocketReturns.append(trialVals)
return pocketReturns
resultDict = {}
games = (FairRoulette, EuRoulette, AmRoulette)
for G in games:
resultDict[G().__str__()] = []
for numSpins in (100, 1000, 10000):
print('\nSimulate betting a pocket for', numTrials, 'trials of', numSpins, 'spins each')
for G in games:
pocketReturns = findPocketReturn(G(), 20, numSpins, False)
mean, std = getMeanAndStd(pocketReturns)
resultDict[G().__str__()].append((numSpins, 100*mean, 100*std))
print(
'Exp. return for', G(), '=',
str(round(100*mean, 3)) +
'%,', '+/-' +
str(round(100*1.96*std, 3)) +
'% with 95% confidence')
```
* Assumptions Underlying Empirical Rule
* The mean estimation error is zero
* The normal distribution
* [Distribution]( https://en.wikipedia.org/wiki/Probability_distribution)
* [PDF](https://en.wikipedia.org/wiki/Probability_density_function)
* [CDF](https://en.wikipedia.org/wiki/Cumulative_distribution_function)
* Normal distribution
* PDF is ${P(x) ={ {1\over \sigma{\sqrt2\pi} }e^{- {{(x-\mu )^2 \over 2\sigma^2}}} }}$ , $e$ is $Euler$'s number
* Sampling Space of Possible Outcomes
* Depends upon ==++variability++== in underlying distribution
---
### Confidence-interval
[The Gaussian distribution](https://en.wikipedia.org/wiki/Normal_distribution)
* Gernerate Normal distrubution data
* [random.gauss](https://docs.python.org/3/library/random.html#random.gauss)
* [numpy.random.normal](https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.normal.html)
* [scipy.stats.norm](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html)
* [self-learning](https://docs.scipy.org/doc/scipy/reference/tutorial/integrate.html)
* Central Limit Theorem(CTL)
1. sufficiently large sample
2. independent random variables
3. the means will be approximately normally distributed
```python
def plotMeans(numDice, numRolls, numBins, legend, color, style):
means = []
for i in range(numRolls//numDice):
vals = 0
for j in range(numDice):
vals += 5*random.random()
means.append(vals/float(numDice))
pylab.hist(means, numBins, color = color, label = legend,
weights = [1/len(means)]*len(means),
hatch = style)
return getMeanAndStd(means)
```
* when computing confidence interval
* If the sample is sufficient large number, the CTL allows us to use the empirical rule
* [Buffon's needle](https://en.wikipedia.org/wiki/Buffon%27s_needle)
* we can use Monte Carlo to do simulate to estimate $\pi$
```python
def throwNeedles(numNeedles):
inCircle = 0
for Needles in range(1, numNeedles + 1, 1):
x = random.random()
y = random.random()
if (x*x + y*y)**0.5 <= 1.0:
inCircle += 1
return 2*(inCircle/float(numNeedles))
def getEst(numNeedles, numTrials):
estimates = []
for t in range(numTrials):
piGuess = throwNeedles(numNeedles)
estimates.append(piGuess)
sDev = numpy.std(estimates)
curEst = sum(estimates)/len(estimates)
print('Est. = ' + str(curEst) +\
', Std. dev. = ' + str(round(sDev, 6))\
+ ', Needles = ' + str(numNeedles))
return (curEst, sDev)
def estPi(precision, numTrials):
numNeedles = 1000
sDev = precision
while sDev >= precision/2:
curEst, sDev = getEst(numNeedles,
numTrials)
numNeedles *= 2
return curEst
random.seed(0)
estPi(0.005, 100)
```
## Sampling and standard error
* recall
* With Monte Carlo simulation we can generate lots of random samples, and use them to compute confidence intervals - `lecture slide`
### Probability Sampling
* has same probability to be chosen
#### Stratified
* using subgroup
*
`TBD`
### conclusion of sampling
* larger sampling is better
## Understanding Experimental Data
* Statistics meet experimental science
* Hooke's law
* $F = -kd$
* if k = 35,000N/m
* F = 350N, if d=0.01m
* how much weight can compress 1cm
* $F = ma$, mass and acceleration
* $Mass = F/a = 350N/g = {350N \over 9.8m/s^2}(N=kg・m/s2) =$ 35.68 kilograms
* data for finding k
```
Distance Mass
0.0865 0.1
0.1015 0.15
0.1106 0.2
0.1279 0.25
0.1892 0.3
0.2695 0.35
0.2888 0.4
0.2425 0.45
0.3465 0.5
0.3225 0.55
0.3764 0.6
0.4263 0.65
0.4562 0.7
```
* you can see the data distribution in the lecture slide
* Fitting curves to data
* objective function
* measure way
* example: measuring distance
```
O---X---------╱
| \ ╱
Y P ╱
| \ ╱
| ╱
| ╱
| ╱
╱
```
* [Least squares objective function](https://en.wikipedia.org/wiki/Least_squares)
* $\sum^{len(observed)-1}_{i=0}{(observed[i]-predicted[i])^2}$
* [Regression](https://en.wikipedia.org/wiki/Regression_analysis)
* Can use [linear regression](https://en.wikipedia.org/wiki/Linear_regression) to slove it
* use pylab.polyfit(x,y,n)
```python
def fitData(fileName):
xVals, yVals = getData(fileName)
xVals = pylab.array(xVals)
yVals = pylab.array(yVals)
xVals = xVals*9.81 #get force
"""
blue point in lecture slide
"""
pylab.plot(xVals, yVals, 'bo',
label = 'Measured points')
labelPlot()
"""
use polyfit to regress X and Y in one domain
"""
a,b = pylab.polyfit(xVals, yVals, 1)
"""
Y = aX + b
"""
estYVals = a*pylab.array(xVals) + b
print('a =', a, 'b =', b)
pylab.plot(xVals, estYVals, 'r',
label = 'Linear fit, k = '
+ str(round(1/a, 5)))
""" or use polyyval to caculate
estYVals = pylab.polyval(model, xVals)
pylab.plot(xVals, estYVals, 'r',
label = 'Linear fit, k = '
+ str(round(1/model[0], 5)))
"""
pylab.legend(loc = 'best')
```
* multi-degree
* use pylab.polyfit(x,y,n)
* when $n \geq 2$
* use mean squared error to compare
```python
def aveMeanSquareError(data, predicted):
error = 0.0
for i in range(len(data)):
error += (data[i] - predicted[i])**2
return error/len(data)
```
* [r-squared](https://en.wikipedia.org/wiki/Coefficient_of_determination)
* ${R^2 = 1 - {\sum_i{(y_i-p_i)^2}\over \sum_i{(y_i-\mu)^2}}}$
```python
def rSquared(observed, predicted):
error = ((predicted - observed)**2).sum()
meanError = error/len(observed)
return 1 - (meanError/numpy.var(observed))
```
* cross validata
* [overfitting](https://en.wikipedia.org/wiki/Overfitting)
* split dataset to dataset1 and dataset2
* use dataset1(2) to predict dataset2(1) and use r-squared to get the best module
* to avoid overfitting
## Introduction to machine learning
* Early definition
* [Samuel, Arthur L. Computer Games I. Springer, New York, NY. pp. 335–365.](https://link.springer.com/chapter/10.1007/978-1-4613-8716-9_14)
#### What is machine learning

reference: [MIT6_0002F16_lec11.pdf](https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-0002-introduction-to-computational-thinking-and-data-science-fall-2016/lecture-slides-and-files/MIT6_0002F16_lec11.pdf) - p.5
1. Observe set
* traning data
2. Infer generated data from traning data
3. To make prediction
* test data
* Variations on paradigm
* `Supervised`:
* [a set has label pairs](https://en.wikipedia.org/wiki/Supervised_learning)
* `Unsupervised`:
* [a set only has feature vectors](https://en.wikipedia.org/wiki/Unsupervised_learning)
* Example of classifying(Supervised) and clustering(Unsupervised)
* data:
* classifying: 
* clustering: 
#### Clustering is an optimization problem
* two popular methods
* hierarchical clustering - worst case $O(n^3)$
1. assign each item to a cluster
2. find the closet pair and merge together
* [Minkowski distanc](https://en.wikipedia.org/wiki/Minkowski_distance)
* [distance in ML - chinese](http://www.cnblogs.com/heaad/archive/2011/03/08/1977733.html)
4. continue 1. and 2. until it remains N cluster
* k-means clustering - greedy algorithm
1. chose k sample as cluster centroids randomly
2. average each cluster and re-assign
3. continue 1. and 2. until the cluster is not changed
* only can find different local optimas
* How to choose K
* run hierarchical clustering on subset of data
#### Classification
* Supervised learning
* Regression
* e.g., linear regression
* Classification
* predicting the lable
1. average the distance with different cluster
2. chose the nearest cluster by distance
* Advantages and demerits
* advantages
* no explicit training
* fast
* demerits
* memory intensive and predictions will take a long time
* [brute force](https://www.slideshare.net/ssuserf88631/knn-51511604)
## Damned lies and statistics
* Garbage in, Garbage out