Exploring Airbnb Dataset

# Exploring Airbnb Dataset ## Context Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in New York City. ## Content The following Airbnb activity is included in this New York dataset: * Listings, including full descriptions and average review score. * Reviews, including unique id for each reviewer and detailed comments. * Calendar, including listing id and the price and availability for that day. ## Code ### Extract ```python= import pandas as pd import numpy as np df = pd.read_csv("Data.csv") df.info ``` **Cell output:** ``` <bound method DataFrame.info of id NAME \ 0 1001254 Clean & quiet apt home by the park 1 1002102 Skylit Midtown Castle 2 1002403 THE VILLAGE OF HARLEM....NEW YORK ! 3 1002755 NaN 4 1003689 Entire Apt: Spacious Studio/Loft by central park ... ... ... 102594 6092437 Spare room in Williamsburg 102595 6092990 Best Location near Columbia U 102596 6093542 Comfy, bright room in Brooklyn 102597 6094094 Big Studio-One Stop from Midtown 102598 6094647 585 sf Luxury Studio host id host_identity_verified host name neighbourhood group \ 0 80014485718 unconfirmed Madaline Brooklyn 1 52335172823 verified Jenna Manhattan 2 78829239556 NaN Elise Manhattan 3 85098326012 unconfirmed Garry Brooklyn 4 92037596077 verified Lyndon Manhattan ... ... ... ... ... 102594 12312296767 verified Krik Brooklyn 102595 77864383453 unconfirmed Mifan Manhattan 102596 69050334417 unconfirmed Megan Brooklyn 102597 11160591270 unconfirmed Christopher Queens 102598 68170633372 unconfirmed Rebecca Manhattan ... 102596 NaN 102597 NaN 102598 NaN [102599 rows x 26 columns]> ``` ### Transform ```python=+ df.replace("nan", np.nan) df.dropna(how="all", inplace=True) # drop row if every column is NaN df.isnull().sum() # check out the nulls in each column ``` **Cell output:** ``` id 0 NAME 250 host id 0 host_identity_verified 289 host name 406 neighbourhood group 29 neighbourhood 16 lat 8 long 8 country 532 country code 131 instant_bookable 105 cancellation_policy 76 room type 0 Construction year 214 price 247 service fee 273 minimum nights 409 number of reviews 183 last review 15893 reviews per month 15879 review rate number 326 calculated host listings count 319 availability 365 448 house_rules 52131 license 102597 dtype: int64 ``` ```python=+ df.drop_duplicates(inplace=True) df.replace({"brookln" : "Brooklyn", "manhatan" : "Manhattan"}, inplace=True) all_neighborhood_group = df["neighbourhood group"].unique() all_neighborhood_group ``` **Cell output:** ```python array(['Brooklyn', 'Manhattan', 'Queens', nan, 'Staten Island', 'Bronx'], dtype=object) ``` ### Exploring the dataset #### Q1. In the data, there are two values of host_identity_verified. Which value is larger? ```python= value_counts = df["host_identity_verified"].value_counts() print(value_counts) ``` **Cell output:** ``` unconfirmed 50944 verified 50825 Name: host_identity_verified, dtype: int64 ``` **Explanation:** There are two values for host_identity_verified: "unconfirmed" and "verified." The count for "unconfirmed" is 50,944, and for "verified," it is 50,825. Therefore, the larger value is "unconfirmed." #### Q2. What are the top 2 neighbourhood_group? **By counts** ```python= # By Counts df["neighbourhood group"].value_counts() ``` **Cell output:** ``` Manhattan 43345 Brooklyn 41445 Queens 13120 Bronx 2677 Staten Island 943 Name: neighbourhood group, dtype: int64 ``` **By review** ```python= # By review df.dropna(subset=["review rate number", "number of reviews"], inplace=True) weighted_mean_review = df.groupby("neighbourhood group").apply(lambda x: np.average(x["review rate number"], weights=x["number of reviews"])) top_2_neighbourhood_groups = weighted_mean_review.sort_values(ascending=False) top_2_neighbourhood_groups ``` **Cell output:** ``` neighbourhood group Staten Island 3.351923 Queens 3.286842 Bronx 3.241769 Brooklyn 3.231208 Manhattan 3.217408 dtype: float64 ``` **Explanation:** There are five neighborhood groups in the data: Staten Island, Queens, Bronx, Brooklyn, and Manhattan. * Sorting the groups by its counts, the top 2 neighbourhood groups are Manhattan and Brooklyn, with counts of 43345 and 41445 respectively. * Sorting the groups by its average review rate, the top 2 neighbourhood groups are Staten Island and Queens, with the review rate of approximately 3.35 and 3.29 respectively. #### Q3. How many room types are in the data and what are their proportions? ```python= room_type_counts = df["room type"].value_counts().to_frame().reset_index() room_type_counts.columns = ["Room Type", "Count"] room_type_counts["Proportion"] = room_type_counts["Count"] / room_type_counts["Count"].sum() room_type_counts ``` **Cell output:** | Room Type | Room Type | Proportion | | --------------- | --------- | ---------- | | Entire home/apt | 53200 | 0.523844 | | Private room | 46041 | 0.453351 | | Shared room | 2201 | 0.021673 | | Hotel room | 115 | 0.001132| **Visualization:** ```python= import matplotlib.pyplot as plt # Create the pie chart fig, ax = plt.subplots(figsize=(8, 8)) fig.set_facecolor("white") wedges, labels, autopcts = plt.pie( room_type_counts["Proportion"], labels=room_type_counts["Room Type"], autopct="%1.2f%%", textprops={"color": "black"} ) # Add the chart title plt.title("Visualization of the Proportion of different Room Types", fontweight="bold") # Display the chart plt.show() ``` ![](https://i.imgur.com/JxJvU1k.png =400x400) **Explanation:** * The pie chart shows that entire homes/apartments and private rooms are the most common types of listings in this dataset, with a combined proportion of over 97%. In contrast, hotel rooms and shared rooms make up a much smaller proportion of chart, at only 0.11% and 2.17%, respectively. * There could be several reasons for this trend: * One possible explanation is that many Airbnb users prefer to have more privacy and space, which they can get from renting an entire home or apartment or a private room. In contrast, shared rooms may not be as popular because they require guests to share a living space with strangers, which may not be appealing to everyone. * There are already many other platforms for travelers to book their hotel rooms, such as Booking.com, Agoda.com, and the official websites of each hotel. Therefore, travelers who have already decided to book hotel rooms may not choose the Airbnb platform. #### Other findings ```python= # top neighbourhoods and the neighbourhood groups they belong df_filtered = df[df["number of reviews"].notna() & (df["number of reviews"] > 0)].reset_index() # Filter out rows where number of reviews is zero or missing neighbourhoods = df_filtered.groupby(["neighbourhood", "neighbourhood group"]).apply(lambda x: np.average(x["review rate number"], weights=x["number of reviews"])) neighbourhoods.sort_values(ascending=False) # Convert the Series to a DataFrame with two columns neighbourhoods = neighbourhoods.reset_index() neighbourhoods.columns = ['neighbourhood', 'neighbourhood group', 'Review rate'] neighbourhoods ``` **Cell output:** ``` neighbourhood neighbourhood group Review rate 0 Allerton Bronx 2.809136 1 Arden Heights Staten Island 3.492857 2 Arrochar Staten Island 3.223071 3 Arverne Queens 3.218746 4 Astoria Queens 3.194131 ... ... ... ... 218 Windsor Terrace Brooklyn 3.019450 219 Woodhaven Queens 3.216062 220 Woodlawn Bronx 3.362140 221 Woodrow Staten Island 2.000000 222 Woodside Queens 3.298526 223 rows × 3 columns ``` **Visualization:** ```python= # Visualization # Create the scatterplot fig, ax = plt.subplots(figsize=(10, 6)) ax.scatter(data=neighbourhoods, x='neighbourhood group', y='Review rate', color='purple') # Set axis labels and title ax.set_xlabel('Neighbourhood Group') ax.set_ylabel('Review Rate') ax.set_title('Review Rate by Neighbourhood Group') plt.show() ``` ![](https://i.imgur.com/n42aYvD.png) **Explanation:** * This scatterplot shows the mean review rate for neighbourhoods and the neighbourhood groups to which they belong. * Most neighbourhoods in the Bronx, Queens, Brooklyn, and Manhattan have ratings between 3.0 and 3.5, while the neighbourhoods in Staten Island have a wider range of ratings.