# **【SF Salaries 薪水分析】ft. Kaggle - Python** :::info [【資料分析作品集 My data analysis portfolio】](https://hackmd.io/jAchN4s6SOG4KekFGtSFtA?view) - 一、瞭解【資料內容】 Checking【data content】 - 使用 Python NumPy、Pandas, Matplolib、Plotly - 資料來源 Data source - 導入資料、查看基本訊息 Import data 、View basic information - 二、資料清理 Data cleaning - 刪除不需要欄位 Delete unnecessary fields - 檢查含有 NaN、Not Provided欄位 Check for fields containing NaN and Not Provided - 資料清理含有 NaN、Not Provided欄位 Data cleaning contains NaN and Not Provided fields - 三、轉換資料型態 Converting 【data type】 - 查看資料,把需要做計算的轉換為數字 Check the data and convert the required calculations into numbers - 四、分析 Analyzing - 關於職位 About the job title - 有幾種職位? How many positions are there? (2159種) - 前十高職位的人數 top_10_job (Transit Operator 7036、Special Nurse 4389...) - 2014 年只有一個人的職缺總數 Total number of job openings with only one person in 2014 (175個) - 職稱中含有'chief'的 Job title containing 'chief' (薪水最高為職位 Chief Investment Officer,平均薪資 $339653.70) - 是否職稱越長的,薪水越高? Does the longer the job title, the higher the salary? (否 No) - 關於員工姓名 About employee name - 有幾種姓名? How many names are there? (118010種) - 前十大姓名的人數 top_10_employee_names (Kevin Lee 13、William Wong 11...) - 重複的姓名 Duplicate name (72113 筆姓名重複、1137 筆姓名, 職稱, 年份 都重複) - 關於底薪 About base pay - 平均底薪最高前五種職位 The top five jobs with the highest basic salary (Chief of Police $309767.683333、Chief, Fire Department $304232.340000) - 關於加班費 About over time pay - 付出最高的加班費 the highest overtime pay ($245131.88) - 每年付出的加班費平均 Average overtime pay per year (2011 年 : 4531.07、2012 年 : 5023.42、2013 年 : 5281.64、2014 年 : 5401.99) - 關於薪資總額 About total pay - 查看年份 View year (2011, 2012, 2013, 2014) - 四年付出的薪資總額 Total salary paid over four years ($11114610134.38) - 每年付出的薪資總額 Total annual salary paid ($2011 年 : $2594195051.88、2012 年 : $2724848200.44、2013 年 : $2918655930.8、2014 年 : $2876910951.26) - 薪資總額平均最高前十種職位 The top ten positions with the highest average salary (GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY $399211.28、Chief Investment Officer $339653.70、Chief of Police $329183.65...) - 薪資最高薪 highest salary (NATHANIEL FORD $567595.43) - 薪資最低薪 lowest salary (Joe Lopez $-618.13) - 薪資總平均 Overall average salary ($74770.33) - 每年薪資總平均 Average annual salary (Year 2011 $71744.10、2012 $74113.26、2013 $77611.44、2014 $75471.84) - 每年領最高薪前三 Top three with the highest salary every year - JobTitle中含有 'engineer' 字樣的人數佔總人數多少? 薪資占總薪資比例多少? How many people have the word 'engineer' in their JobTitle account for the total number of people? What is the proportion of salary to total salary? (Engineers make up 3.20 % of the total employees.Engineers earn 4.25 % of the total salary.) ::: 因為我用jupyter notebook,.py檔的要自己加上 print()、plt.show() <br/> ### 資料來源 Data source [Kaggle : SF Salaries](https://www.kaggle.com/datasets/kaggle/sf-salaries) ![](https://hackmd.io/_uploads/rJZDfyXla.png) <br/> ### 一、瞭解【資料內容】 Checking【data content】 ### 導入資料 查看基本訊息 ### Import data 、View basic information - df ```= import numpy as np import pandas as pd github_url = "https://raw.githubusercontent.com/06Cata/Analyzing_SF_Salaries/main/Salaries.csv" salary = pd.read_csv(github_url) #salary.head() salary ``` ![](https://hackmd.io/_uploads/BkunfkXxp.png) <br/> - 數量 quantity ```= # 等同 (len(salary.index) , len(salary.columns)) salary.shape # 會印出 (148654, 13) ``` <br/> - 資訊 information BasePay、OvertimePay、OtherPay、Benefits type 都是object,如果要計算需要轉type ```= salary.info() ``` ![](https://hackmd.io/_uploads/rJCH7yXgp.png) <br/> - 統計訊息 Statistical summary information ```= salary.describe() ``` ![](https://hackmd.io/_uploads/r1PEmJ7g6.png) <br/> - 查看唯一值 ```= salary.nunique() ``` ![](https://hackmd.io/_uploads/ryl2ysJWp.png) <br/> > PS 一開始也可以直接透過Pandas,存進Sqlites ```= # 讀取 CSV 文件 csv_file = 'https://raw.githubusercontent.com/06Cata/Analyzing_SF_Salaries/main/Salaries.csv' df = pd.read_csv(csv_file) # 使用 pandas 的 to_sql 方法將 DataFrame 寫入 SQLite 資料庫中 df.to_sql('sf_salaries', conn, if_exists='replace', index=False) # 連線到 SQLite 資料庫 database = "kaggle.db" conn = sqlite3.connect(database) # 查詢所有table,data = pd.read_sql('select * from sqlite_master where type="table"',conn) Salaries = pd.read_sql('select * from sf_salaries',conn) Salaries.head() ``` <br/> ### 二、資料清理 Data cleaning ### 刪除不需要欄位 Delete unnecessary fields 設一個新表,不動到原表 ```= salary_new = salary salary_new = salary_new.drop(['Id','Notes','Agency','Status'], axis=1) salary_new ``` ![](https://hackmd.io/_uploads/H1rq5qyZT.png) <br/> ### 檢查含有 NaN、Not Provided欄位 ### Check for fields containing NaN and Not Provided - 欄位是否有NaN ```= # 等同 salary.isnull().sum() salary.isna().sum() ``` ![](https://hackmd.io/_uploads/S1nCSqy-p.png) <br/> - 假設我想看缺失值的圖 ```= import seaborn as sns import matplotlib.pyplot as plt missing_percentage = (salary.isnull().sum() / len(salary)) * 100 # Create a data frame containing percentage information missing_info = pd.DataFrame({'Column': salary.columns, 'MissingPercentage': missing_percentage}) missing_info = missing_info.sort_values(by='MissingPercentage', ascending=False) plt.figure(figsize=(12, 6)) sns.barplot(x='MissingPercentage', y='Column', data=missing_info) # plt.barh(missing_info['Column'], missing_info['MissingPercentage']) plt.title('Missing Values Percentage') plt.xlabel('Percentage') plt.ylabel('Column') ``` ![](https://hackmd.io/_uploads/rJ4HK91bT.png) <br/> - 欄位是否包含Not Provided ```= (salary == "Not Provided").any() ``` ![](https://hackmd.io/_uploads/rkJhUcyW6.png) 數量 ```= (salary == "Not Provided").sum() ``` ![](https://hackmd.io/_uploads/BJeRdoJbT.png) <br/> ### 資料清理含有 NaN、Not Provided欄位 ### Data cleaning contains NaN and Not Provided fields - NaN -> 0 ```= salary_new['BasePay'].fillna(0, inplace=True) salary_new['OvertimePay'].fillna(0, inplace=True) salary_new['OtherPay'].fillna(0, inplace=True) salary_new['TotalPay'].fillna(0, inplace=True) salary_new ``` - Not provided -> drop ```= # 删除包含“Not provided”的列 delete columns contain "Not provided" # 三種方法都可以 Three ways # salary_new = salary_new[~salary_new['BasePay'].astype(str).str.contains('Not provided')] # salary_new = salary_new[~salary_new['OvertimePay'].astype(str).str.contains('Not provided')] # salary_new = salary_new[~salary_new['OtherPay'].astype(str).str.contains('Not provided')] # salary_new = salary_new[~salary_new['TotalPay'].astype(str).str.contains('Not provided')] # salary_new = salary_new[salary_new['BasePay']!="Not Provided"] # salary_new = salary_new[salary_new['OvertimePay']!="Not Provided"] # salary_new = salary_new[salary_new['OtherPay']!="Not Provided"] # salary_new = salary_new[salary_new['TotalPay']!="Not Provided"] index_drop = salary_new[(salary_new['BasePay'] == "Not Provided") | (salary_new['OvertimePay'] == "Not Provided") | (salary_new['OtherPay'] == "Not Provided") | (salary_new['TotalPay'] == "Not Provided")].index salary_new = salary_new.drop(index_drop) salary_new ``` ![](https://hackmd.io/_uploads/r1EHaskZp.png) <br/> ### 三、轉換資料型態 Converting 【data type】 ### 查看資料,把需要做計算的轉換為數字 Check the data and convert the required calculations into numbers ```= salary_new['BasePay'] = pd.to_numeric(salary_new['BasePay'], errors='coerce').astype(float) salary_new['OvertimePay'] = pd.to_numeric(salary_new['OvertimePay'], errors='coerce').astype(float) salary_new['OtherPay'] = pd.to_numeric(salary_new['OtherPay'], errors='coerce').astype(float) salary_new['Benefits'] = pd.to_numeric(salary_new['Benefits'], errors='coerce').astype(float) salary_new.info() ``` ![](https://hackmd.io/_uploads/H14iejJZp.png) <br/> 檢查是否包含非int或float的值 Check if it contains a value other than int or float ```= non_numeric_basepay = salary_new[~salary_new['BasePay'].apply(lambda x: isinstance(x, (int, float)))]['BasePay'] non_numeric_overtimepay = salary_new[~salary_new['OvertimePay'].apply(lambda x: isinstance(x, (int, float)))]['OvertimePay'] non_numeric_otherpay = salary_new[~salary_new['OtherPay'].apply(lambda x: isinstance(x, (int, float)))]['OtherPay'] print("Non-numeric values in 'BasePay' column:") print(non_numeric_basepay) print("Non-numeric values in 'OvertimePay' column:") print(non_numeric_overtimepay) print("Non-numeric values in 'OtherPay' column:") print(non_numeric_otherpay) ``` ![](https://hackmd.io/_uploads/BkWy12JZp.png) <br/> 轉換為數字欄位、新增一欄名為['SALARY_AMOUNT'],用來確認薪資總額是否正確 Convert to a numeric field, add a new column named ['SALARY_AMOUNT'], and use it to confirm whether the total salary is correct ```= salary_new['SALARY_AMOUNT'] = salary_new['BasePay'] + salary_new['OvertimePay'] + salary_new['OtherPay'] ``` <br/> 設置容忍度,新增一欄名為['SALARY_AMOUNT_status'],確認 salary['SALARY_AMOUNT''] == salary['TotalPay'] Set the tolerance, add a new column named ['SALARY_AMOUNT_status'], and confirm salary['SALARY_AMOUNT''] == salary['TotalPay'] ```= tolerance = 0.01 # 容忍度 salary_new['SALARY_AMOUNT_status'] = abs(salary_new['SALARY_AMOUNT'] - salary_new['TotalPay']) < tolerance salary_new[salary_new['SALARY_AMOUNT_status'] == False] ``` ![](https://hackmd.io/_uploads/Sy_MRsyW6.png) <br/> ### 四、分析 Analyzing ### 關於職位 About job title - 有幾種職位? How many positions are there? ```= salary['JobTitle'].nunique() # 等同 len(salary['JobTitle'] .value_counts()) # 會得到 2159 ``` <br/> - 前十高職位的人數 top_10_job ```= top_10_jobs = salary['JobTitle'].value_counts().head(10) ``` ![](https://hackmd.io/_uploads/HkTjZylba.png) <br/> 繪圖,.index 找到索引、.values 找到次數 ```= import matplotlib.pyplot as plt top_10_jobs = salary['JobTitle'].value_counts().head(10) job_titles = top_10_jobs.index # 找到索引 job_salaries = top_10_jobs.values # 找到次數 plt.figure(figsize=(12, 6)) plt.bar(job_titles, job_salaries) plt.xticks(rotation=60) plt.xlabel('Job Title') plt.ylabel('Count') plt.title('Top 10 Job Titles by Salary') plt.show() ``` ![](https://hackmd.io/_uploads/Skblfcyba.png) <br/> - 2014 年只有一個人的職缺總數 Total number of job openings with only one person in 2014 ```= sum(salary_new[salary_new['Year']==2014]['JobTitle'].value_counts() == 1) ``` <br/> 找到value_counts() == 1,設.index.tolist() ```= unique_job_titles_2014 = salary_new[salary_new['Year'] == 2014]['JobTitle'].value_counts() == 1 unique_job_titles_2014 = unique_job_titles_2014[unique_job_titles_2014].index.tolist() filtered_one_person = salary_new[(salary_new['Year'] == 2014) & (salary_new['JobTitle'].isin(unique_job_titles_2014))] filtered_one_person ``` ![](https://hackmd.io/_uploads/ryHr6JeWa.png) <br/> 也可以 ```= result = salary_new[salary_new['Year'] == 2014]['JobTitle'].value_counts() == 1 result[result] # <> # false_results = result[~result] # print(false_results) ``` ![](https://hackmd.io/_uploads/rk5_61xZ6.png) <br/> - 職稱中含有'chief'的 Job title containing 'chief' ```= def chief_string(title): if 'chief' in title.lower().split(): return True else: return False salary_new[salary_new['JobTitle'].apply(lambda nam: chief_string(nam))] ``` ![](https://hackmd.io/_uploads/SJks8PebT.png) <br/> ```= # 等同 # salary_new[salary_new['JobTitle'].apply(lambda x: # chief_string(x))].groupby('JobTitle').size().reset_index() # chief_job_counts.columns = ['JobTitle', 'Count'] # chief_job_counts chief_job_counts = salary_new[salary_new['JobTitle'].apply(lambda x: chief_string(x))]['JobTitle'].value_counts().reset_index() # 提取查看 chief_job_counts.columns = ['JobTitle', 'Count'] chief_job_counts ``` ![](https://hackmd.io/_uploads/HJ9JDDeZT.png) <br/> 依照薪水平均高低排序 Sort by average salary ```= chief_jobs = salary_new[salary_new['JobTitle'].apply(lambda x: chief_string(x))] average_pay_by_job = chief_jobs.groupby('JobTitle')['TotalPay'].mean().sort_values(ascending=False).reset_index() # 提取查看 average_pay_by_job.columns = ['JobTitle', 'AverageTotalPay'] average_pay_by_job ``` ![](https://hackmd.io/_uploads/By3vFweZa.png) 繪圖,前十位平均薪水高的 Top ten with highest average salary ```= import matplotlib.pyplot as plt plt.figure(figsize=(12, 6)) job_titles = average_pay_by_job['JobTitle'].head(10) average_pay = average_pay_by_job['AverageTotalPay'].head(10) plt.plot(job_titles, average_pay, marker='o', linestyle='-', color='b') plt.title('Average Total Pay by Job Title') plt.xlabel('Job Title') plt.ylabel('Average Total Pay') plt.xticks(rotation=90) plt.grid(True) plt.tight_layout() plt.show() ``` ![](https://hackmd.io/_uploads/S1YQcQbZT.png) <br/> >PS 如果求平均薪資最高值 the highest value ```= chief_jobs = salary_new[salary_new['JobTitle'].apply(lambda x: chief_string(x))] average_pay_by_job = chief_jobs.groupby('JobTitle')['TotalPay'].mean().reset_index() highest_average_pay = average_pay_by_job[average_pay_by_job['TotalPay'] == average_pay_by_job['TotalPay'].max()] highest_average_pay ``` ![](https://hackmd.io/_uploads/H12eqwxZa.png) <br/> - 是否職稱越長的,薪水越高? Does the longer the job title, the higher the salary? 否 No ```= salary_new['title_len'] = salary_new['JobTitle'].apply(len) salary_new ``` ![](https://hackmd.io/_uploads/Byn43wg-a.png) <br/> ```= salary_new[['TotalPay', 'title_len']].corr() ``` ![](https://hackmd.io/_uploads/ryJ8hwg-a.png) <br/> ```= import matplotlib.pyplot as plt correlation = salary_new[['TotalPay', 'title_len']].corr().iloc[0, 1] plt.figure(figsize=(8, 6)) plt.scatter(salary_new['title_len'], salary_new['TotalPay'], alpha=0.5) plt.title(f'Correlation between Job Title Length and Total Pay\nCorrelation Coefficient: {correlation:.2f}') plt.xlabel('Job Title Length') plt.ylabel('Total Pay') # +趨勢線 z = np.polyfit(salary_new['title_len'], salary_new['TotalPay'], 1) p = np.poly1d(z) plt.plot(salary_new['title_len'], p(salary_new['title_len']), "r--") plt.show() # 會印出 ``` ![](https://hackmd.io/_uploads/rkzD2wxbT.png) <br/> ### 關於員工姓名 About employee name - 有幾種姓名? How many names are there? 148654人中,110810是唯一值 ```= salary_new['EmployeeName'].nunique() # 會印出 118010 ``` <br/> - 前十大姓名的人數 top_10_employee_names ```= top_10_employee_names = salary_new['EmployeeName'].value_counts().head(10) top_10_employee_names ``` ![](https://hackmd.io/_uploads/SkIVi6kZT.png) ```= plt.figure(figsize=(10, 6)) top_10_employee_names.plot(kind='bar') plt.title('Top 10 Employee Names') plt.xlabel('Employee Names') plt.ylabel('Count') plt.xticks(rotation=45) plt.show() ``` ![](https://hackmd.io/_uploads/By9vi6JZT.png) <br/> - 重複的姓名 Duplicate name 有 72113 筆姓名重複 There are 72113 duplicate names ```= duplicate_names = salary_new[salary_new['EmployeeName'].duplicated(keep=False)] duplicate_names ``` ![](https://hackmd.io/_uploads/HyPXp6Jba.png) ![](https://hackmd.io/_uploads/B1q7aa1-p.png) <br/> 有 1137 筆姓名, 職稱, 年份 都重複 There are 1137 names, titles, and years repeated ```= duplicate_names_job = salary_new[salary_new.duplicated(subset=['EmployeeName', 'JobTitle','Year'], keep=False)] duplicate_names_job ``` ![](https://hackmd.io/_uploads/rkcO51eZp.png) ![](https://hackmd.io/_uploads/rk0d9keZa.png) <br/> ### 關於底薪 About base pay - 平均底薪最高前五種職位 The top five jobs with the highest basic salary ```= top_5_basepay = salary_new.groupby('JobTitle')['BasePay'].mean().sort_values(ascending=False).head(5) top_5_basepay ``` ![](https://hackmd.io/_uploads/ryCfUJlWa.png) <br/> 轉換為新的dataframe,繪製成圖 ```= import matplotlib.pyplot as plt top_5_basepay = salary_new.groupby('JobTitle')['BasePay'].mean().sort_values(ascending=False).head(5) top_5_basepay_df = top_5_basepay.reset_index() top_5_basepay_df ``` ![](https://hackmd.io/_uploads/rkHpSyl-T.png) <br/> ```= plt.figure(figsize=(10, 6)) plt.plot(top_5_basepay_df['JobTitle'], top_5_basepay_df['BasePay'], marker='o', linestyle='-', color='b') plt.xlabel('Job Title') plt.ylabel('Average BasePay') plt.title('Top 5 Job Titles by Average BasePay') plt.xticks(rotation=45, ha='right') plt.grid(True) plt.tight_layout() plt.show() ``` ![](https://hackmd.io/_uploads/rkwgIkxZp.png) <br/> ### 關於加班費 About over time pay - 付出最高的加班費 the highest overtime pay ```= # 單純知道數字 salary_new['OvertimePay'].max() salary_new.loc[salary_new['OvertimePay'].idxmax()] ``` ![](https://hackmd.io/_uploads/rkRHo0J-p.png) 或是 ```= salary_new[salary_new['OvertimePay'] == salary_new['OvertimePay'].max()] ``` ![](https://hackmd.io/_uploads/B1D9iCkbT.png) <br/> - 每年付出的加班費平均 Average overtime pay per year ```= OvertimePay_2011 = round(salary_new[salary_new['Year'] == 2011]['OvertimePay'].mean(), 2) OvertimePay_2012 = round(salary_new[salary_new['Year'] == 2012]['OvertimePay'].mean(), 2) OvertimePay_2013 = round(salary_new[salary_new['Year'] == 2013]['OvertimePay'].mean(), 2) OvertimePay_2014 = round(salary_new[salary_new['Year'] == 2014]['OvertimePay'].mean(), 2) print(f"2011 年 : {OvertimePay_2011}") print(f"2012 年 : {OvertimePay_2012}") print(f"2013 年 : {OvertimePay_2013}") print(f"2014 年 : {OvertimePay_2014}") ``` ![](https://hackmd.io/_uploads/By_f-BZZ6.png) <br/> ```= import matplotlib.pyplot as plt years = [2011, 2012, 2013, 2014] OvertimePay_max = [OvertimePay_2011, OvertimePay_2012, OvertimePay_2013, OvertimePay_2014] plt.figure(figsize=(10, 6)) plt.plot(years, OvertimePay_max, marker='o', linestyle='-', color='b') plt.title('Average Over time pay over the years') plt.xlabel('Year') plt.ylabel('Average BasePay') plt.grid(True) plt.show() ``` ![](https://hackmd.io/_uploads/HkO_hC1Z6.png) <br/> ### 關於薪資總額 About total pay - 查看年份 View year ```= salary_new['Year'].unique() # 會印出 array([2011, 2012, 2013, 2014], dtype=int64) ``` - 四年付出的薪資總額 Total salary paid over four years ```= pd.options.display.float_format = '{:.2f}'.format salary_new['TotalPay'].sum() # 會印出 11114610134.380001 ``` <br/> - 每年付出的薪資總額 Total annual salary paid ```= pd.options.display.float_format = '{:.2f}'.format totalpay_2011_sum = round(salary_new[salary_new['Year'] == 2011]['TotalPay'].sum(), 2) totalpay_2012_sum = round(salary_new[salary_new['Year'] == 2012]['TotalPay'].sum(), 2) totalpay_2013_sum = round(salary_new[salary_new['Year'] == 2013]['TotalPay'].sum(), 2) totalpay_2014_sum = round(salary_new[salary_new['Year'] == 2014]['TotalPay'].sum(), 2) print(f"2011 年 : {totalpay_2011_sum}") print(f"2012 年 : {totalpay_2011_sum}") print(f"2013 年 : {totalpay_2011_sum}") print(f"2014 年 : {totalpay_2011_sum}") ``` ![](https://hackmd.io/_uploads/S1lP-Bb-6.png) ```= import matplotlib.pyplot as plt years = [2011, 2012, 2013, 2014] basepay_sum = [totalpay_2011_sum, totalpay_2012_sum, totalpay_2013_sum, totalpay_2014_sum] plt.figure(figsize=(10, 6)) plt.plot(years, basepay_sum, marker='o', linestyle='-', color='b', label='Yearly Summary') plt.title('Summary total pay Over the Years') plt.xlabel('Year') plt.ylabel('Average total pay') plt.grid(True) plt.show() ``` ![](https://hackmd.io/_uploads/BJE5z0y-6.png) <br/> >PS 快速寫法 ```= salary_new.groupby('Year')['TotalPay'].sum().plot() ``` ![](https://hackmd.io/_uploads/HJS0p0kZa.png) <br/> - 薪資總額平均最高前十種職位 The top ten positions with the highest average salary ```= salary_new.groupby('JobTitle')['TotalPay'].mean().sort_values(ascending=False).head(10) ``` ![](https://hackmd.io/_uploads/r1NszkgbT.png) 轉換為新的dataframe,繪製成圖 ```= top_10_job_salaries = salary_new.groupby('JobTitle')['TotalPay'].mean().sort_values(ascending=False).head(10) top_10_job_salaries_df = top_10_job_salaries.reset_index() top_10_job_salaries_df ``` ![](https://hackmd.io/_uploads/SkzGEkl-p.png) ```= plt.figure(figsize=(12, 6)) plt.plot(top_10_job_salaries_df['JobTitle'], top_10_job_salaries_df['TotalPay'], marker='o', linestyle='-', color='b') plt.xlabel('Job Title') plt.ylabel('Average Total Pay') plt.title('Top 10 Job Titles by Average Total Pay') plt.xticks(rotation=45, ha='right') plt.grid(True) plt.tight_layout() plt.show() ``` ![](https://hackmd.io/_uploads/SyCBEJlWp.png) <br/> - 薪資最高薪 highest salary ```= # 單純知道數字 salary_new['TotalPay'].max() salary_new.loc[salary_new['TotalPay'].idxmax()] ``` ![](https://hackmd.io/_uploads/HJGmGAyW6.png) <br/> - 薪資最低薪 lowest salary ```= salary_new.loc[salary_new['TotalPay'].idxmin()] ``` ![](https://hackmd.io/_uploads/SyzXM0kb6.png) <br/> >PS 假設要找2011最高薪 ```= max_totalpay_2011 = salary_new[salary_new['Year'] == 2011]['TotalPay'].idxmax() highest_totalpay_2011 = salary_new.loc[max_totalpay_2011] highest_totalpay_2011 ``` <br/> - 薪資總平均 Overall average salary ```= salary_new['TotalPay'].mean() # 會印出 74770.33390097546 ``` <br/> - 每年薪資總平均 Average annual salary ```= salary_new.groupby('Year')['TotalPay'].mean() ``` ![](https://hackmd.io/_uploads/ByMSzS--T.png) <br/> ```= salary_new.groupby('Year')['TotalPay'].mean().plot() ``` ![](https://hackmd.io/_uploads/SkbmyygWa.png) <br/> ```= import matplotlib.pyplot as plt plt.figure(figsize=(10, 6)) plt.hist(salary_new['TotalPay'], bins=30, edgecolor='k', alpha=0.7) plt.xlabel('Total Pay') plt.ylabel('Frequency') plt.title('Histogram of Total Pay') totalpay_mean = salary_new['TotalPay'].mean() totalpay_25th_percentile = salary_new['TotalPay'].quantile(0.25) totalpay_75th_percentile = salary_new['TotalPay'].quantile(0.75) # line for the overall average plt.axvline(totalpay_mean, color='r', linestyle='dashed', linewidth=2, label=f'Mean: {totalpay_mean:.2f}') # middle 50% of the salary range plt.axvspan(totalpay_25th_percentile, totalpay_75th_percentile, color='yellow', alpha=0.3, label='50% Salary Range') plt.legend() plt.show() ``` ![](https://hackmd.io/_uploads/ByT8O0kWT.png) <br/> - 每年領最高薪前三 Top three with the highest salary every year ```= sorted_salary = salary_new.sort_values(by=['Year', 'TotalPay'], ascending=[True, False]) top_3_salary_by_year = sorted_salary.groupby('Year').head(3) top_3_salary_by_year[['Year', 'EmployeeName', 'TotalPay']] ``` ![](https://hackmd.io/_uploads/HJany1g-a.png) <br/> - 每年薪資結構占比的平均值 The average annual salary structure proportion ```= numeric_columns = ['BasePay', 'OvertimePay', 'OtherPay', 'TotalPay'] average_numeric_values_2011 = salary_new[salary_new['Year'] == 2011][numeric_columns].mean() average_numeric_values_2012 = salary_new[salary_new['Year'] == 2012][numeric_columns].mean() average_numeric_values_2013 = salary_new[salary_new['Year'] == 2013][numeric_columns].mean() average_numeric_values_2014 = salary_new[salary_new['Year'] == 2014][numeric_columns].mean() ``` ```= import pandas as pd average_values_df = pd.DataFrame({ 'Year': [2011, 2012, 2013, 2014], 'BasePay': [ average_numeric_values_2011['BasePay'], average_numeric_values_2012['BasePay'], average_numeric_values_2013['BasePay'], average_numeric_values_2014['BasePay'] ], 'OvertimePay': [ average_numeric_values_2011['OvertimePay'], average_numeric_values_2012['OvertimePay'], average_numeric_values_2013['OvertimePay'], average_numeric_values_2014['OvertimePay'] ], 'OtherPay': [ average_numeric_values_2011['OtherPay'], average_numeric_values_2012['OtherPay'], average_numeric_values_2013['OtherPay'], average_numeric_values_2014['OtherPay'] ], 'TotalPay': [ average_numeric_values_2011['TotalPay'], average_numeric_values_2012['TotalPay'], average_numeric_values_2013['TotalPay'], average_numeric_values_2014['TotalPay'] ] }) average_values_df ``` ![](https://hackmd.io/_uploads/r1-vgklWp.png) ```= plt.figure(figsize=(10, 6)) plt.plot(average_values_df['Year'], average_values_df['BasePay'], marker='o', label='BasePay') plt.plot(average_values_df['Year'], average_values_df['OvertimePay'], marker='o', label='OvertimePay') plt.plot(average_values_df['Year'], average_values_df['OtherPay'], marker='o', label='OtherPay') plt.plot(average_values_df['Year'], average_values_df['TotalPay'], marker='o', label='TotalPay') plt.xlabel('Year') plt.ylabel('Average Pay') plt.title('Average Pay by Year') plt.legend() plt.grid(True) plt.show() ``` ![](https://hackmd.io/_uploads/S1jClye-T.png) <br/> - JobTitle中含有 'engineer' 字樣的人數佔總人數多少? 薪資占總薪資比例多少? How many people have the word 'engineer' in their JobTitle account for the total number of people? What is the proportion of salary to total salary? ```= engineer_count = salary_new[salary_new['JobTitle'].str.contains('engineer', case=False)]['EmployeeName'].count() engineer_total_pay = salary_new[salary_new['JobTitle'].str.contains('engineer', case=False)]['TotalPay'].sum() total_count = salary_new['EmployeeName'].count() total_pay = salary_new['TotalPay'].sum() ``` ```= engineer_percentage_of_total_count = (engineer_count / total_count) * 100 engineer_percentage_of_total_pay = (engineer_total_pay / total_pay) * 100 print(f"Engineers make up {engineer_percentage_of_total_count:.2f} % of the total employees.") print(f"Engineers earn {engineer_percentage_of_total_pay:.2f} % of the total salary.") # 會印出 Engineers make up 3.20 % of the total employees. Engineers earn 4.25 % of the total salary. ```