基礎數據探索

使用 Pandas 熟悉你的數據

任何機器學習項目的第一步就是熟悉手中資料，將在此使用 pandas 函式庫，Pandas 是數據科學家用於探索和處理數據的主要工具，大多數人將 pandas 縮寫為 pd。我們使用以下命令執行此操作。


import pandas as pd

Pandas 函式庫中最重要的部分是 DataFrame，DataFrame 包含了您可能認為是表格的數據類型。這類似於 Excel 中的工作表或 SQL 數據庫中的表。

Pandas 具有強大的方法，可以處理您想要使用此類的大多數數據執行操作。

接下來，我們將查看有關澳大利亞墨爾本房價的數據。在動手練習時，您將對新數據集應用相同的過程，該數據集包含愛荷華州的房價。

範例（墨爾本）數據文件路徑位於 ../input/melbourne-housing-snapshot/melb_data.csv 中。

我們使用以下代碼加載和探索數據：






# save filepath to variable for easier access
melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()

	Rooms	Price	Distance	Postcode	Bedroom2	Bathroom	Car	Landsize	BuildingArea	YearBuilt	Lattitude	Longtitude	Propertycount
count	13580.000000	1.358000e+04	13580.000000	13580.000000	13580.000000	13580.000000	13518.000000	13580.000000	7130.000000	8205.000000	13580.000000	13580.000000	13580.000000
mean	2.937997	1.075684e+06	10.137776	3105.301915	2.914728	1.534242	1.610075	558.416127	151.967650	1964.684217	-37.809203	144.995216	7454.417378
std	0.955748	6.393107e+05	5.868725	90.676964	0.965921	0.691712	0.962634	3990.669241	541.014538	37.273762	0.079260	0.103916	4378.581772
min	1.000000	8.500000e+04	0.000000	3000.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1196.000000	-38.182550	144.431810	249.000000
25%	2.000000	6.500000e+05	6.100000	3044.000000	2.000000	1.000000	1.000000	177.000000	93.000000	1940.000000	-37.856822	144.929600	4380.000000
50%	3.000000	9.030000e+05	9.200000	3084.000000	3.000000	1.000000	2.000000	440.000000	126.000000	1970.000000	-37.802355	145.000100	6555.000000
75%	3.000000	1.330000e+06	13.000000	3148.000000	3.000000	2.000000	2.000000	651.000000	174.000000	1999.000000	-37.756400	145.058305	10331.000000
max	10.000000	9.000000e+06	48.100000	3977.000000	20.000000	8.000000	10.000000	433014.000000	44515.000000	2018.000000	-37.408530	145.526350	21650.000000

了解數據描述

結果顯示原始數據集中每列有 8 個數字。第一列count顯示有多少行無缺失值。

缺失值的產生有多種原因。例如，在測量一間臥室的房子時，不會收集第二間臥室的大小。我們將回到缺失數據的主題。

第二列是mean，也就是平均值。在此之下，std 是標準差，它衡量值數字上的分佈情況。

要解釋最小值、25%、50%、75% 和最大值，請想像將每一列從最低值到最高值排序。第一個值是最小值。如果您瀏覽列表的四分之一，您會發現一個大於值的 25% 且小於值的 75% 的數字。這是 25% 的值（發音為“第 25 個百分位數”）。50% 和 75% 的定義類似，max 是最大的數。

To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value. The first (smallest) value is the min. If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values. That is the 25% value (pronounced "25th percentile"). The 50th and 75th percentiles are defined analogously, and the max is the largest number.

到你了

開始您的 第一個編碼練習

翻譯來源

Basic Data Exploration

基礎數據探索

使用 Pandas 熟悉你的數據

了解數據描述

到你了

翻譯來源

Read more

紀錄在Laravel中使用Line message api以及LIFF

Laravel 11 使用 Pusher Channels 實現廣播功能完整教程

Laravel 跨網站自動登入實踐：利用 JWT 深入解析

基于微服务架构的物联网中间件设计