# Exploring Airbnb Dataset
## Context
Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this dataset describes the listing activity of homestays in New York City.
## Content
The following Airbnb activity is included in this New York dataset:
* Listings, including full descriptions and average review score.
* Reviews, including unique id for each reviewer and detailed comments.
* Calendar, including listing id and the price and availability for that day.
## Code
### Extract
```python=
import pandas as pd
import numpy as np
df = pd.read_csv("Data.csv")
df.info
```
**Cell output:**
```
<bound method DataFrame.info of id NAME \
0 1001254 Clean & quiet apt home by the park
1 1002102 Skylit Midtown Castle
2 1002403 THE VILLAGE OF HARLEM....NEW YORK !
3 1002755 NaN
4 1003689 Entire Apt: Spacious Studio/Loft by central park
... ... ...
102594 6092437 Spare room in Williamsburg
102595 6092990 Best Location near Columbia U
102596 6093542 Comfy, bright room in Brooklyn
102597 6094094 Big Studio-One Stop from Midtown
102598 6094647 585 sf Luxury Studio
host id host_identity_verified host name neighbourhood group \
0 80014485718 unconfirmed Madaline Brooklyn
1 52335172823 verified Jenna Manhattan
2 78829239556 NaN Elise Manhattan
3 85098326012 unconfirmed Garry Brooklyn
4 92037596077 verified Lyndon Manhattan
... ... ... ... ...
102594 12312296767 verified Krik Brooklyn
102595 77864383453 unconfirmed Mifan Manhattan
102596 69050334417 unconfirmed Megan Brooklyn
102597 11160591270 unconfirmed Christopher Queens
102598 68170633372 unconfirmed Rebecca Manhattan
...
102596 NaN
102597 NaN
102598 NaN
[102599 rows x 26 columns]>
```
### Transform
```python=+
df.replace("nan", np.nan)
df.dropna(how="all", inplace=True) # drop row if every column is NaN
df.isnull().sum() # check out the nulls in each column
```
**Cell output:**
```
id 0
NAME 250
host id 0
host_identity_verified 289
host name 406
neighbourhood group 29
neighbourhood 16
lat 8
long 8
country 532
country code 131
instant_bookable 105
cancellation_policy 76
room type 0
Construction year 214
price 247
service fee 273
minimum nights 409
number of reviews 183
last review 15893
reviews per month 15879
review rate number 326
calculated host listings count 319
availability 365 448
house_rules 52131
license 102597
dtype: int64
```
```python=+
df.drop_duplicates(inplace=True)
df.replace({"brookln" : "Brooklyn", "manhatan" : "Manhattan"}, inplace=True)
all_neighborhood_group = df["neighbourhood group"].unique()
all_neighborhood_group
```
**Cell output:**
```python
array(['Brooklyn', 'Manhattan', 'Queens', nan, 'Staten Island', 'Bronx'],
dtype=object)
```
### Exploring the dataset
#### Q1. In the data, there are two values of host_identity_verified. Which value is larger?
```python=
value_counts = df["host_identity_verified"].value_counts()
print(value_counts)
```
**Cell output:**
```
unconfirmed 50944
verified 50825
Name: host_identity_verified, dtype: int64
```
**Explanation:**
There are two values for host_identity_verified: "unconfirmed" and "verified." The count for "unconfirmed" is 50,944, and for "verified," it is 50,825. Therefore, the larger value is "unconfirmed."
#### Q2. What are the top 2 neighbourhood_group?
**By counts**
```python=
# By Counts
df["neighbourhood group"].value_counts()
```
**Cell output:**
```
Manhattan 43345
Brooklyn 41445
Queens 13120
Bronx 2677
Staten Island 943
Name: neighbourhood group, dtype: int64
```
**By review**
```python=
# By review
df.dropna(subset=["review rate number", "number of reviews"], inplace=True)
weighted_mean_review = df.groupby("neighbourhood group").apply(lambda x: np.average(x["review rate number"], weights=x["number of reviews"]))
top_2_neighbourhood_groups = weighted_mean_review.sort_values(ascending=False)
top_2_neighbourhood_groups
```
**Cell output:**
```
neighbourhood group
Staten Island 3.351923
Queens 3.286842
Bronx 3.241769
Brooklyn 3.231208
Manhattan 3.217408
dtype: float64
```
**Explanation:**
There are five neighborhood groups in the data: Staten Island, Queens, Bronx, Brooklyn, and Manhattan.
* Sorting the groups by its counts, the top 2 neighbourhood groups are Manhattan and Brooklyn, with counts of 43345 and 41445 respectively.
* Sorting the groups by its average review rate, the top 2 neighbourhood groups are Staten Island and Queens, with the review rate of approximately 3.35 and 3.29 respectively.
#### Q3. How many room types are in the data and what are their proportions?
```python=
room_type_counts = df["room type"].value_counts().to_frame().reset_index()
room_type_counts.columns = ["Room Type", "Count"]
room_type_counts["Proportion"] = room_type_counts["Count"] / room_type_counts["Count"].sum()
room_type_counts
```
**Cell output:**
| Room Type | Room Type | Proportion |
| --------------- | --------- | ---------- |
| Entire home/apt | 53200 | 0.523844 |
| Private room | 46041 | 0.453351 |
| Shared room | 2201 | 0.021673 |
| Hotel room | 115 | 0.001132|
**Visualization:**
```python=
import matplotlib.pyplot as plt
# Create the pie chart
fig, ax = plt.subplots(figsize=(8, 8))
fig.set_facecolor("white")
wedges, labels, autopcts = plt.pie(
room_type_counts["Proportion"],
labels=room_type_counts["Room Type"],
autopct="%1.2f%%",
textprops={"color": "black"}
)
# Add the chart title
plt.title("Visualization of the Proportion of different Room Types", fontweight="bold")
# Display the chart
plt.show()
```

**Explanation:**
* The pie chart shows that entire homes/apartments and private rooms are the most common types of listings in this dataset, with a combined proportion of over 97%. In contrast, hotel rooms and shared rooms make up a much smaller proportion of chart, at only 0.11% and 2.17%, respectively.
* There could be several reasons for this trend:
* One possible explanation is that many Airbnb users prefer to have more privacy and space, which they can get from renting an entire home or apartment or a private room. In contrast, shared rooms may not be as popular because they require guests to share a living space with strangers, which may not be appealing to everyone.
* There are already many other platforms for travelers to book their hotel rooms, such as Booking.com, Agoda.com, and the official websites of each hotel. Therefore, travelers who have already decided to book hotel rooms may not choose the Airbnb platform.
#### Other findings
```python=
# top neighbourhoods and the neighbourhood groups they belong
df_filtered = df[df["number of reviews"].notna() & (df["number of reviews"] > 0)].reset_index() # Filter out rows where number of reviews is zero or missing
neighbourhoods = df_filtered.groupby(["neighbourhood", "neighbourhood group"]).apply(lambda x: np.average(x["review rate number"], weights=x["number of reviews"]))
neighbourhoods.sort_values(ascending=False)
# Convert the Series to a DataFrame with two columns
neighbourhoods = neighbourhoods.reset_index()
neighbourhoods.columns = ['neighbourhood', 'neighbourhood group', 'Review rate']
neighbourhoods
```
**Cell output:**
```
neighbourhood neighbourhood group Review rate
0 Allerton Bronx 2.809136
1 Arden Heights Staten Island 3.492857
2 Arrochar Staten Island 3.223071
3 Arverne Queens 3.218746
4 Astoria Queens 3.194131
... ... ... ...
218 Windsor Terrace Brooklyn 3.019450
219 Woodhaven Queens 3.216062
220 Woodlawn Bronx 3.362140
221 Woodrow Staten Island 2.000000
222 Woodside Queens 3.298526
223 rows × 3 columns
```
**Visualization:**
```python=
# Visualization
# Create the scatterplot
fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(data=neighbourhoods, x='neighbourhood group', y='Review rate', color='purple')
# Set axis labels and title
ax.set_xlabel('Neighbourhood Group')
ax.set_ylabel('Review Rate')
ax.set_title('Review Rate by Neighbourhood Group')
plt.show()
```

**Explanation:**
* This scatterplot shows the mean review rate for neighbourhoods and the neighbourhood groups to which they belong.
* Most neighbourhoods in the Bronx, Queens, Brooklyn, and Manhattan have ratings between 3.0 and 3.5, while the neighbourhoods in Staten Island have a wider range of ratings.