Day 77 - Advanced - Linear Regression and Data Visualisation with Seaborn

# Day 77 - Advanced - Linear Regression and Data Visualisation with Seaborn ###### tags: `Python100` [TOC] ## 646. Day 77 Goals: what you will make by the end of the day - 電影預算和收入。 - 越高的電影預算會帶來越多票房收入嗎？ - 電影製片廠是否應在電影上花更多錢製作？ ## 647. Explore and Clean the Data - import/load data ```python= import pandas as pd import matplotlib.pyplot as plt pd.options.display.float_format = '{:,.2f}'.format from pandas.plotting import register_matplotlib_converters register_matplotlib_converters() data = pd.read_csv('cost_revenue_dirty.csv') ``` - 資料集包含多少行和列？ - 是否存在任何NaN值？ - 是否有重複的行？ - 列的資料類型是什麼？ - input1: ```python= data.shape data.head() ``` - output1: ```python= (5391, 6) ``` ![](https://i.imgur.com/L7uqiiw.png) - input2: ```python= data.isna().values.any() ##NAN ``` - output2: ```python= False ``` - input3: ```python= data.duplicated().values.any() ``` - output3: ```python= False ``` - input4: ```python= data.info() ``` - output4: ![](https://i.imgur.com/0076xF2.png) - 刪除```$```和```,```符號，將```USD_Production_Budget```、```USD_Worldwide_Gross```和```USD_Domestic_Gross``` 列轉換為數字格式。 - input1: ```python= columns=["USD_Production_Budget","USD_Worldwide_Gross","USD_Domestic_Gross"] for i in columns: data[i]=data[i].astype(str).str.replace(",","") data[i]=data[i].astype(str).str.replace("$","") data[i]=pd.to_numeric(data[i]) data.head() ``` - output1: ![](https://i.imgur.com/0ZusZsA.png) - 將```Release_Date```列轉換為```Pandas```日期時間類型。 - input1: ```python= data.Release_Date=pd.to_datetime(data.Release_Date) data.head() ``` - output1: ![](https://i.imgur.com/DnpRnrh.png) - input2: ```python= data.info() ``` - output2: ```python= <class 'pandas.core.frame.DataFrame'> RangeIndex: 5391 entries, 0 to 5390 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 5391 non-null int64 1 Release_Date 5391 non-null datetime64[ns] 2 Movie_Title 5391 non-null object 3 USD_Production_Budget 5391 non-null int64 4 USD_Worldwide_Gross 5391 non-null int64 5 USD_Domestic_Gross 5391 non-null int64 dtypes: datetime64[ns](1), int64(4), object(1) memory usage: 252.8+ KB ``` ## 648. Investigate the Films that had Zero Revenue - 1.資料集中電影的平均製作預算是多少？31,113,737.58 - 2.全球電影的平均總收入是多少？88,855、421.96 - 3.全球和國內收入的最低限額是多少？0、0 - 4.最底層25%的電影實際上是盈利還是虧損？虧損 - 5.任何電影的最高製作預算和全球最高總收入是多少？425,000,000.00、2,783,918,982.00 - 6.最低和最高預算電影的收入是多少？1,100.00、425,000,000.00 - input1: ```python= data.describe() ``` - output1: ![](https://i.imgur.com/z3V7OlB.png) - 有多少部電影在國內(美國)的票房收入為0美元？什麼是沒有收入的最高預算電影？512、don gato el inicio de la pandilla - input1: ```python= data[data["USD_Domestic_Gross"]==0].sort_values("USD_Production_Budget",ascending=False) ``` - output1: ![](https://i.imgur.com/adEj32N.png) **資料集編制日期2018年5月1日** - 全球有多少電影票房為 0 美元？在國際上沒有收入的最高預算電影是什麼？357,The Ridiculous 6 - input1: ```python= data[data["USD_Worldwide_Gross"]==0].sort_values("USD_Production_Budget",ascending=False) ``` - output1: ![](https://i.imgur.com/nQJeCmf.png) ## 649. Filter on Multiple Conditions: International Films - 全球電影票房總收入但在美國零收入，創建一個子集。 - input1: ```python= international_releases = data.loc[(data.USD_Domestic_Gross == 0) & (data.USD_Worldwide_Gross != 0)] international_releases ``` - output1: ![](https://i.imgur.com/XAJJEiW.png) - 使用```.query()```。 - input1: ```python= international_releases = data.query("USD_Domestic_Gross == 0 and USD_Worldwide_Gross != 0") international_releases ``` - output1: ![](https://i.imgur.com/04h7f9C.png) - 截至數據收集時(2018年5月1日)哪些電影尚未上映。資料集中有多少電影還沒有機會在票房上映？(建立data_clean的dataframe) - input1: ```python= # Date of Data Collection after_scape_release=data[data.Release_Date>=scrape_date] after_scape_release # len(after_scape_release) #7 ``` - output1: ![](https://i.imgur.com/2smdwhc.png) - input2: ```python= data_clean = data.drop(after_scape_release.index) data_clean ``` - output2: ![](https://i.imgur.com/NAxVEG4.png) - 製作成本超過全球總收入的電影百分比是多少？ - input1: ```python= data_money_lost= data_clean[data_clean.USD_Production_Budget>data_clean.USD_Worldwide_Gross] data_money_lost ``` - output1: ![](https://i.imgur.com/VCUFzzX.png) - input2: ```python= lost_percentage="{:.2%}".format(len(data_money_lost)/len(data_clean)) print(lost_percentage) ``` - output2: ```python= 37.28% ``` ## 650. Seaborn Data Visualisation: Bubble Charts - 散佈圖 - input1: ```python= # Seaborn is built on top of Matplotlib plt.figure(figsize=(8,4),dpi=200) ax=sns.scatterplot(data=data_clean,x='USD_Production_Budget',y='USD_Worldwide_Gross') ax.set(ylim=(0, 3000000000),xlim=(0, 450000000), ylabel='Revenue in $ billions',xlabel='Budget in $100 millions') plt.show() ``` - output1: ![](https://i.imgur.com/vKHEbLe.png) - 氣泡圖(bubble chart) - input1: ```python= plt.figure(figsize=(8,4), dpi=200) # set styling on a single chart with sns.axes_style('darkgrid'): # style ax = sns.scatterplot(data=data_clean, x='USD_Production_Budget', y='USD_Worldwide_Gross', hue='USD_Worldwide_Gross', #color size='USD_Worldwide_Gross') #dot size ax.set(ylim=(0, 3000000000), xlim=(0, 450000000), ylabel='Revenue in $ billions', xlabel='Budget in $100 millions') ``` - output1: ![](https://i.imgur.com/ykV8vd3.png) ## 651. Floor Division: A Trick to Convert Years to Decades - 年轉換成十年 - input1: ```python= dt_index = pd.DatetimeIndex(data_clean.Release_Date) #Create a DatetimeIndex object dt_index ``` - output1: ![](https://i.imgur.com/FZMoyfe.png) - input2: ```python= years = dt_index.year years ``` - output2: ![](https://i.imgur.com/3SgmdJB.png) - input3: ```python= data_clean['Decade']=years//10*10 data_clean.Decade ``` - output3: ![](https://i.imgur.com/YYNtvvk.png) - input4: ```python= data_clean.head() ``` - output4: ![](https://i.imgur.com/RYQ0LtR.png) - 依據1970年切割成2個dataframe - input1: ```python= before_data=data_clean[data_clean.Decade<1970] before_data ``` - output1: ![](https://i.imgur.com/tYjZSGe.png) - input2: ```python= after_data=data_clean[data_clean.Decade>=1970] after_data ``` - output2: ![](https://i.imgur.com/GbRWF8j.png) - input3: ```python= before_data.describe() ``` - output3: ![](https://i.imgur.com/3xGm8JE.png) - input4: ```python= after_data.describe() ``` - output4: ![](https://i.imgur.com/nEuOVEk.png) ## 652. Plotting Linear Regressions with Seaborn - 線性回歸將電影預算與全球收入間的關係視覺化```.regplot()``` - Before 1970 - input1: ```python= sns.regplot(data=before_data, x='USD_Production_Budget', y='USD_Worldwide_Gross') ``` - output1: ![](https://i.imgur.com/1DkHLW0.png) - input2: ```python= plt.figure(figsize=(8,4), dpi=200) with sns.axes_style("whitegrid"): #grid sns.regplot(data=before_data, x='USD_Production_Budget', y='USD_Worldwide_Gross', scatter_kws = {'alpha': 0.4}, line_kws = {'color': 'black'}) ``` - output2: ![](https://i.imgur.com/qxuqHrf.png) - 電影製作預算和電影收入之間的關係不強。 - After 1970 - input1: ```python= plt.figure(figsize=(8,4), dpi=200) with sns.axes_style('darkgrid'): ax = sns.regplot(data=after_data, x='USD_Production_Budget', y='USD_Worldwide_Gross', color='#2f4b7c', scatter_kws = {'alpha': 0.3}, line_kws = {'color': '#ff7c43'}) ax.set(ylim=(0, 3000000000), xlim=(0, 450000000), ylabel='Revenue in $ billions', xlabel='Budget in $100 millions') ``` - output1: ![](https://i.imgur.com/atK1rWC.png) - 預算為1.5億美元的電影->約5億美元收入 ## 653. Use scikit-learn to Run Your Own Regression - 使用scikit-learn線性回歸模型 ![](https://i.imgur.com/wd5kMgn.png) - 找出模型對theta的估計值。 - 對```before_data```運行線性回歸。計算截距、斜率和r-squared。在這種情況下，線性模型可以解釋多少電影收入的差異？ - y軸上截距:若預算為0電影的收入是多少。 - 斜率:電影預算增加1美元可獲得多少額外收入。 - input1: ```python= from sklearn.linear_model import LinearRegression regression = LinearRegression() # Explanatory Variable(s) or Feature(s) X = pd.DataFrame(after_data, columns=['USD_Production_Budget']) # Response Variable or Target y = pd.DataFrame(after_data, columns=['USD_Worldwide_Gross']) # Find the best-fit line regression.fit(X, y) ``` - ouptut1: ```python= LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False) ``` - input2: ```python= regression.intercept_ # theta 0 regression.coef_ # theta 1 ``` - output2: ```python= array([-8650768.00661027]) array([[3.12259592]]) ``` - 預算每增加1美元，電影收入就會增加約3美元。 - input1: ```python= # R-squared regression.score(X, y) ``` - output1: - r-squared約為0.558。 - 模型解釋約56%的電影收入差異。 ## 654. Learning Points & Summary