---
# System prepended metadata

title: Exploring Airbnb Dataset

---

# Exploring Airbnb Dataset
## Context
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in New York City.

## Content
The following Airbnb activity is included in this New York dataset:
* Listings, including full descriptions and average review score. 
* Reviews, including unique id for each reviewer and detailed comments. 
* Calendar, including listing id and the price and availability for that day.

## Code
### Extract
```python=
import pandas as pd
import numpy as np
df = pd.read_csv("Data.csv")
df.info
```
**Cell output:**
```
<bound method DataFrame.info of              id                                              NAME  \
0       1001254                Clean & quiet apt home by the park   
1       1002102                             Skylit Midtown Castle   
2       1002403               THE VILLAGE OF HARLEM....NEW YORK !   
3       1002755                                               NaN   
4       1003689  Entire Apt: Spacious Studio/Loft by central park   
...         ...                                               ...   
102594  6092437                        Spare room in Williamsburg   
102595  6092990                     Best Location near Columbia U   
102596  6093542                    Comfy, bright room in Brooklyn   
102597  6094094                  Big Studio-One Stop from Midtown   
102598  6094647                              585 sf Luxury Studio   

            host id host_identity_verified    host name neighbourhood group  \
0       80014485718            unconfirmed     Madaline            Brooklyn   
1       52335172823               verified        Jenna           Manhattan   
2       78829239556                    NaN        Elise           Manhattan   
3       85098326012            unconfirmed        Garry            Brooklyn   
4       92037596077               verified       Lyndon           Manhattan   
...             ...                    ...          ...                 ...   
102594  12312296767               verified         Krik            Brooklyn   
102595  77864383453            unconfirmed        Mifan           Manhattan   
102596  69050334417            unconfirmed        Megan            Brooklyn   
102597  11160591270            unconfirmed  Christopher              Queens   
102598  68170633372            unconfirmed      Rebecca           Manhattan   
...
102596     NaN  
102597     NaN  
102598     NaN  

[102599 rows x 26 columns]>
```
### Transform
```python=+
df.replace("nan", np.nan)
df.dropna(how="all", inplace=True)  # drop row if every column is NaN
df.isnull().sum()  # check out the nulls in each column
```
**Cell output:**
```
id                                     0
NAME                                 250
host id                                0
host_identity_verified               289
host name                            406
neighbourhood group                   29
neighbourhood                         16
lat                                    8
long                                   8
country                              532
country code                         131
instant_bookable                     105
cancellation_policy                   76
room type                              0
Construction year                    214
price                                247
service fee                          273
minimum nights                       409
number of reviews                    183
last review                        15893
reviews per month                  15879
review rate number                   326
calculated host listings count       319
availability 365                     448
house_rules                        52131
license                           102597
dtype: int64
```
```python=+
df.drop_duplicates(inplace=True)
df.replace({"brookln" : "Brooklyn", "manhatan" : "Manhattan"}, inplace=True)
all_neighborhood_group = df["neighbourhood group"].unique()
all_neighborhood_group
```
**Cell output:**
```python
array(['Brooklyn', 'Manhattan', 'Queens', nan, 'Staten Island', 'Bronx'],
      dtype=object)
```
### Exploring the dataset
#### Q1. In the data, there are two values of host_identity_verified. Which value is larger?
```python=
value_counts = df["host_identity_verified"].value_counts()
print(value_counts)
```
**Cell output:**
```
unconfirmed    50944
verified       50825
Name: host_identity_verified, dtype: int64
```
**Explanation:**
There are two values for host_identity_verified: "unconfirmed" and "verified." The count for "unconfirmed" is 50,944, and for "verified," it is 50,825. Therefore, the larger value is "unconfirmed."

#### Q2. What are the top 2 neighbourhood_group?
**By counts**
```python=
# By Counts
df["neighbourhood group"].value_counts()
```
**Cell output:**
```
Manhattan        43345
Brooklyn         41445
Queens           13120
Bronx             2677
Staten Island      943
Name: neighbourhood group, dtype: int64
```
**By review**
```python=
# By review
df.dropna(subset=["review rate number", "number of reviews"], inplace=True)
weighted_mean_review = df.groupby("neighbourhood group").apply(lambda x: np.average(x["review rate number"], weights=x["number of reviews"]))
top_2_neighbourhood_groups = weighted_mean_review.sort_values(ascending=False)
top_2_neighbourhood_groups
```
**Cell output:**
```
neighbourhood group
Staten Island    3.351923
Queens           3.286842
Bronx            3.241769
Brooklyn         3.231208
Manhattan        3.217408
dtype: float64
```
**Explanation:**
There are five neighborhood groups in the data: Staten Island, Queens, Bronx, Brooklyn, and Manhattan.
* Sorting the groups by its counts, the top 2 neighbourhood groups are Manhattan and Brooklyn, with counts of 43345 and 41445 respectively.
* Sorting the groups by its average review rate, the top 2 neighbourhood groups are Staten Island and Queens, with the review rate of approximately 3.35 and 3.29 respectively.

#### Q3. How many room types are in the data and what are their proportions?
```python=
room_type_counts = df["room type"].value_counts().to_frame().reset_index()
room_type_counts.columns = ["Room Type", "Count"]
room_type_counts["Proportion"] = room_type_counts["Count"] / room_type_counts["Count"].sum()
room_type_counts
```
**Cell output:**

| Room Type       | Room Type | Proportion |
| --------------- | --------- | ---------- |
| Entire home/apt | 53200     | 0.523844   |
| Private room    | 46041     | 0.453351   |
| Shared room | 2201 | 0.021673 |
| Hotel room | 115 | 0.001132|

**Visualization:**
```python=
import matplotlib.pyplot as plt

# Create the pie chart
fig, ax = plt.subplots(figsize=(8, 8))
fig.set_facecolor("white")
wedges, labels, autopcts = plt.pie(
    room_type_counts["Proportion"],
    labels=room_type_counts["Room Type"],
    autopct="%1.2f%%",
    textprops={"color": "black"}
)

# Add the chart title
plt.title("Visualization of the Proportion of different Room Types", fontweight="bold")

# Display the chart
plt.show()
```
![](https://i.imgur.com/JxJvU1k.png =400x400)

**Explanation:**
* The pie chart shows that entire homes/apartments and private rooms are the most common types of listings in this dataset, with a combined proportion of over 97%. In contrast, hotel rooms and shared rooms make up a much smaller proportion of chart, at only 0.11% and 2.17%, respectively.
* There could be several reasons for this trend:
    * One possible explanation is that many Airbnb users prefer to have more privacy and space, which they can get from renting an entire home or apartment or a private room. In contrast, shared rooms may not be as popular because they require guests to share a living space with strangers, which may not be appealing to everyone.
    * There are already many other platforms for travelers to book their hotel rooms, such as Booking.com, Agoda.com, and the official websites of each hotel. Therefore, travelers who have already decided to book hotel rooms may not choose the Airbnb platform.

#### Other findings
```python=
# top neighbourhoods and the neighbourhood groups they belong
df_filtered = df[df["number of reviews"].notna() & (df["number of reviews"] > 0)].reset_index()  # Filter out rows where number of reviews is zero or missing
neighbourhoods = df_filtered.groupby(["neighbourhood", "neighbourhood group"]).apply(lambda x: np.average(x["review rate number"], weights=x["number of reviews"]))
neighbourhoods.sort_values(ascending=False)

# Convert the Series to a DataFrame with two columns
neighbourhoods = neighbourhoods.reset_index()
neighbourhoods.columns = ['neighbourhood', 'neighbourhood group', 'Review rate']
neighbourhoods
```
**Cell output:**
```
	neighbourhood	neighbourhood group	Review rate
0	Allerton	Bronx	2.809136
1	Arden Heights	Staten Island	3.492857
2	Arrochar	Staten Island	3.223071
3	Arverne	Queens	3.218746
4	Astoria	Queens	3.194131
...	...	...	...
218	Windsor Terrace	Brooklyn	3.019450
219	Woodhaven	Queens	3.216062
220	Woodlawn	Bronx	3.362140
221	Woodrow	Staten Island	2.000000
222	Woodside	Queens	3.298526
223 rows × 3 columns
```
**Visualization:**
```python=
# Visualization
# Create the scatterplot
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data=neighbourhoods, x='neighbourhood group', y='Review rate', color='purple')

# Set axis labels and title
ax.set_xlabel('Neighbourhood Group')
ax.set_ylabel('Review Rate')
ax.set_title('Review Rate by Neighbourhood Group')

plt.show()
```
![](https://i.imgur.com/n42aYvD.png)

**Explanation:**
* This scatterplot shows the mean review rate for neighbourhoods and the neighbourhood groups to which they belong. 
* Most neighbourhoods in the Bronx, Queens, Brooklyn, and Manhattan have ratings between 3.0 and 3.5, while the neighbourhoods in Staten Island have a wider range of ratings.