# Day 77 - Advanced - Linear Regression and Data Visualisation with Seaborn
###### tags: `Python100`
[TOC]
## 646. Day 77 Goals: what you will make by the end of the day
- 電影預算和收入。
- 越高的電影預算會帶來越多票房收入嗎?
- 電影製片廠是否應在電影上花更多錢製作?
## 647. Explore and Clean the Data
- import/load data
```python=
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.float_format = '{:,.2f}'.format
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
data = pd.read_csv('cost_revenue_dirty.csv')
```
- 資料集包含多少行和列?
- 是否存在任何NaN值?
- 是否有重複的行?
- 列的資料類型是什麼?
- input1:
```python=
data.shape
data.head()
```
- output1:
```python=
(5391, 6)
```
![](https://i.imgur.com/L7uqiiw.png)
- input2:
```python=
data.isna().values.any() ##NAN
```
- output2:
```python=
False
```
- input3:
```python=
data.duplicated().values.any()
```
- output3:
```python=
False
```
- input4:
```python=
data.info()
```
- output4:
![](https://i.imgur.com/0076xF2.png)
- 刪除```$```和```,```符號,將```USD_Production_Budget```、```USD_Worldwide_Gross```和```USD_Domestic_Gross``` 列轉換為數字格式。
- input1:
```python=
columns=["USD_Production_Budget","USD_Worldwide_Gross","USD_Domestic_Gross"]
for i in columns:
data[i]=data[i].astype(str).str.replace(",","")
data[i]=data[i].astype(str).str.replace("$","")
data[i]=pd.to_numeric(data[i])
data.head()
```
- output1:
![](https://i.imgur.com/0ZusZsA.png)
- 將```Release_Date```列轉換為```Pandas```日期時間類型。
- input1:
```python=
data.Release_Date=pd.to_datetime(data.Release_Date)
data.head()
```
- output1:
![](https://i.imgur.com/DnpRnrh.png)
- input2:
```python=
data.info()
```
- output2:
```python=
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5391 entries, 0 to 5390
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 5391 non-null int64
1 Release_Date 5391 non-null datetime64[ns]
2 Movie_Title 5391 non-null object
3 USD_Production_Budget 5391 non-null int64
4 USD_Worldwide_Gross 5391 non-null int64
5 USD_Domestic_Gross 5391 non-null int64
dtypes: datetime64[ns](1), int64(4), object(1)
memory usage: 252.8+ KB
```
## 648. Investigate the Films that had Zero Revenue
- 1.資料集中電影的平均製作預算是多少?31,113,737.58
- 2.全球電影的平均總收入是多少?88,855、421.96
- 3.全球和國內收入的最低限額是多少?0、0
- 4.最底層25%的電影實際上是盈利還是虧損?虧損
- 5.任何電影的最高製作預算和全球最高總收入是多少?425,000,000.00、2,783,918,982.00
- 6.最低和最高預算電影的收入是多少?1,100.00、425,000,000.00
- input1:
```python=
data.describe()
```
- output1:
![](https://i.imgur.com/z3V7OlB.png)
- 有多少部電影在國內(美國)的票房收入為0美元?什麼是沒有收入的最高預算電影?512、don gato el inicio de la pandilla
- input1:
```python=
data[data["USD_Domestic_Gross"]==0].sort_values("USD_Production_Budget",ascending=False)
```
- output1:
![](https://i.imgur.com/adEj32N.png)
**資料集編制日期2018年5月1日**
- 全球有多少電影票房為 0 美元?在國際上沒有收入的最高預算電影是什麼?357,The Ridiculous 6
- input1:
```python=
data[data["USD_Worldwide_Gross"]==0].sort_values("USD_Production_Budget",ascending=False)
```
- output1:
![](https://i.imgur.com/nQJeCmf.png)
## 649. Filter on Multiple Conditions: International Films
- 全球電影票房總收入但在美國零收入,創建一個子集。
- input1:
```python=
international_releases = data.loc[(data.USD_Domestic_Gross == 0) & (data.USD_Worldwide_Gross != 0)]
international_releases
```
- output1:
![](https://i.imgur.com/XAJJEiW.png)
- 使用```.query()```。
- input1:
```python=
international_releases = data.query("USD_Domestic_Gross == 0 and USD_Worldwide_Gross != 0")
international_releases
```
- output1:
![](https://i.imgur.com/04h7f9C.png)
- 截至數據收集時(2018年5月1日)哪些電影尚未上映。資料集中有多少電影還沒有機會在票房上映?(建立data_clean的dataframe)
- input1:
```python=
# Date of Data Collection
after_scape_release=data[data.Release_Date>=scrape_date]
after_scape_release
# len(after_scape_release) #7
```
- output1:
![](https://i.imgur.com/2smdwhc.png)
- input2:
```python=
data_clean = data.drop(after_scape_release.index)
data_clean
```
- output2:
![](https://i.imgur.com/NAxVEG4.png)
- 製作成本超過全球總收入的電影百分比是多少?
- input1:
```python=
data_money_lost= data_clean[data_clean.USD_Production_Budget>data_clean.USD_Worldwide_Gross]
data_money_lost
```
- output1:
![](https://i.imgur.com/VCUFzzX.png)
- input2:
```python=
lost_percentage="{:.2%}".format(len(data_money_lost)/len(data_clean))
print(lost_percentage)
```
- output2:
```python=
37.28%
```
## 650. Seaborn Data Visualisation: Bubble Charts
- 散佈圖
- input1:
```python=
# Seaborn is built on top of Matplotlib
plt.figure(figsize=(8,4),dpi=200)
ax=sns.scatterplot(data=data_clean,x='USD_Production_Budget',y='USD_Worldwide_Gross')
ax.set(ylim=(0, 3000000000),xlim=(0, 450000000),
ylabel='Revenue in $ billions',xlabel='Budget in $100 millions')
plt.show()
```
- output1:
![](https://i.imgur.com/vKHEbLe.png)
- 氣泡圖(bubble chart)
- input1:
```python=
plt.figure(figsize=(8,4), dpi=200)
# set styling on a single chart
with sns.axes_style('darkgrid'): # style
ax = sns.scatterplot(data=data_clean,
x='USD_Production_Budget',
y='USD_Worldwide_Gross',
hue='USD_Worldwide_Gross', #color
size='USD_Worldwide_Gross') #dot size
ax.set(ylim=(0, 3000000000),
xlim=(0, 450000000),
ylabel='Revenue in $ billions',
xlabel='Budget in $100 millions')
```
- output1:
![](https://i.imgur.com/ykV8vd3.png)
## 651. Floor Division: A Trick to Convert Years to Decades
- 年轉換成十年
- input1:
```python=
dt_index = pd.DatetimeIndex(data_clean.Release_Date) #Create a DatetimeIndex object
dt_index
```
- output1:
![](https://i.imgur.com/FZMoyfe.png)
- input2:
```python=
years = dt_index.year
years
```
- output2:
![](https://i.imgur.com/3SgmdJB.png)
- input3:
```python=
data_clean['Decade']=years//10*10
data_clean.Decade
```
- output3:
![](https://i.imgur.com/YYNtvvk.png)
- input4:
```python=
data_clean.head()
```
- output4:
![](https://i.imgur.com/RYQ0LtR.png)
- 依據1970年切割成2個dataframe
- input1:
```python=
before_data=data_clean[data_clean.Decade<1970]
before_data
```
- output1:
![](https://i.imgur.com/tYjZSGe.png)
- input2:
```python=
after_data=data_clean[data_clean.Decade>=1970]
after_data
```
- output2:
![](https://i.imgur.com/GbRWF8j.png)
- input3:
```python=
before_data.describe()
```
- output3:
![](https://i.imgur.com/3xGm8JE.png)
- input4:
```python=
after_data.describe()
```
- output4:
![](https://i.imgur.com/nEuOVEk.png)
## 652. Plotting Linear Regressions with Seaborn
- 線性回歸將電影預算與全球收入間的關係視覺化```.regplot()```
- Before 1970
- input1:
```python=
sns.regplot(data=before_data,
x='USD_Production_Budget',
y='USD_Worldwide_Gross')
```
- output1:
![](https://i.imgur.com/1DkHLW0.png)
- input2:
```python=
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style("whitegrid"): #grid
sns.regplot(data=before_data,
x='USD_Production_Budget',
y='USD_Worldwide_Gross',
scatter_kws = {'alpha': 0.4},
line_kws = {'color': 'black'})
```
- output2:
![](https://i.imgur.com/qxuqHrf.png)
- 電影製作預算和電影收入之間的關係不強。
- After 1970
- input1:
```python=
plt.figure(figsize=(8,4), dpi=200)
with sns.axes_style('darkgrid'):
ax = sns.regplot(data=after_data,
x='USD_Production_Budget',
y='USD_Worldwide_Gross',
color='#2f4b7c',
scatter_kws = {'alpha': 0.3},
line_kws = {'color': '#ff7c43'})
ax.set(ylim=(0, 3000000000),
xlim=(0, 450000000),
ylabel='Revenue in $ billions',
xlabel='Budget in $100 millions')
```
- output1:
![](https://i.imgur.com/atK1rWC.png)
- 預算為1.5億美元的電影->約5億美元收入
## 653. Use scikit-learn to Run Your Own Regression
- 使用scikit-learn線性回歸模型
![](https://i.imgur.com/wd5kMgn.png)
- 找出模型對theta的估計值。
- 對```before_data```運行線性回歸。計算截距、斜率和r-squared。在這種情況下,線性模型可以解釋多少電影收入的差異?
- y軸上截距:若預算為0電影的收入是多少。
- 斜率:電影預算增加1美元可獲得多少額外收入。
- input1:
```python=
from sklearn.linear_model import LinearRegression
regression = LinearRegression()
# Explanatory Variable(s) or Feature(s)
X = pd.DataFrame(after_data, columns=['USD_Production_Budget'])
# Response Variable or Target
y = pd.DataFrame(after_data, columns=['USD_Worldwide_Gross'])
# Find the best-fit line
regression.fit(X, y)
```
- ouptut1:
```python=
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
```
- input2:
```python=
regression.intercept_ # theta 0
regression.coef_ # theta 1
```
- output2:
```python=
array([-8650768.00661027])
array([[3.12259592]])
```
- 預算每增加1美元,電影收入就會增加約3美元。
- input1:
```python=
# R-squared
regression.score(X, y)
```
- output1:
- r-squared約為0.558。
- 模型解釋約56%的電影收入差異。
## 654. Learning Points & Summary