---
title: Project14
tags: teach:MF
---
###### tags: `機器學習與金融科技`
# ML and FinTech: Project by 宋昱奇
> **keywords**:property、Machine Learning
## 1. Motivations
Hotel Booking Cancelation Prediction
* It would be nice for the hotels to have a model to predict if a guest will actually come.This can help a hotel to plan things like personel and food requirements.
Maybe some hotels also use such a model to offer more rooms than they have to make more money...
---
## 2. EDA
* Feature descriptions: Provide explanations of each feature
| Feature name | Explanation |
| ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------ |
| Hotel | (H1 = Resort Hotel or H2 = City Hotel) | |
| is_canceled | if the booking was canceled (1) or not (0) | |
| lead_time | 预订日期和到达日期之间经过的天数 | |
| arrival_date_year | Year of arrival date | |
| arrival_date_month | Month of arrival date | |
| arrival_date_week_number | Week number of year for arrival date | |
| arrival_date_day_of_month | Day of arrival date | |
| stays_in_weekend_nights | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel | |
| stays_in_week_nights | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel | |
| adults | Number of adults | |
| children | Number of children | |
| babies | Number of babies | |
| meal | Type of meal booked.BB–Bed & Breakfast;HB–Half board (breakfast and one other meal;FB – Full board | |
| country | Country of origin. | |
| market_segment | In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” | |
| distribution_channel | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” | |
| is_repeated_guest | Value indicating if the booking name was from a repeated guest (1) or not (0) | |
| previous_cancellations | Number of previous bookings that were cancelled by the customer prior to the current booking | |
| previous_bookings_not_canceled | Number of previous bookings not cancelled by the customer prior to the current booking | |
| reserved_room_type | Code of room type reserved. | |
| assigned_room_type | Code for the type of room assigned to the booking. | |
| booking_changes | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation | |
| deposit_type | Indication on if the customer made a deposit to guarantee the booking. | |
| agent | ID of the travel agency that made the booking | |
| company | ID of the company | |
| days_in_waiting_list | Number of days the booking was in the waiting list before it was confirmed to the customer | |
| customer_type | Type of booking, assuming one of four categories:Contract\Group\Transient\Transient-party | |
| adr | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights | |
| required_car_parking_spaces | Number of car parking spaces required by the customer | |
| total_of_special_requests | Number of special requests made by the customer (e.g. twin bed or high floor) | |
| reservation_status | Reservation last status | |
| reservation_status_date | Date at which the last status was set. | |
* Briefly browse the data


```
df.hist(figsize=(20,15))
plt.show()
```

2.1酒店預定量與取消量

City Hotel的预定量与取消量都高于Resort Hotel,但Resort Hotel取消率为27.8%,而City Hotel的取消率达到了41.7%
2.2各個月份預定量和取消率


2.3客戶類型

2.4預定途徑

2.5各類旅客日均開銷

City Hotel各類客戶的日均開銷均高於Resort Hotel;在四種類型的客戶中,散客(Transient)的消費最高,團體客(Group)最低
2.6 Top 10 Country of Origin graph

2.7房間類型與預定取消量

在預定量前7的房型中,A、D房型的取消率均高於其他房型,A房型的取消率更是高達44.5%
---
Brief summary
1.City Hotel的預定量和取消率都遠高於Resort Hotel。
2.新客的取消率比老客高24%,因此,酒店應重點關注新客的預訂與入住體驗
3.不退押金(Non Refund)這一類型的取消預訂率高達99%
4.A、D房型的取消率遠高於其他房型。
5.酒店應利用好每年78月的旅遊旺季,可以在保證服務質量的同時適當提高價格獲取更多利潤,在淡季(冬季)的時候進行優惠活動,如聖誕大促和新年活動,減少酒店空房率。
---
## 3. Problem formulation
Problem formulation
We would like to predict y from x=(x1,x2,...,xp)with p=31
$$Y =
\left\{\begin{array}{ll}1,&\mbox{is canceled},\\0,&\mbox{no canceled}\end{array}\right.$$
We use Logistic regression as a benchmark method
對於訓練和測試數據,我們用accuracy值進行模型比較。
## 4. 資料分析 Data analysis

預訂狀態('reservation_status')與是否取消預訂的相關性最高,達到了0.92,但考慮到後續可能會導致模型過擬合,所以刪除
We use K-Fold cross validation (K=5) to validate our model
```
# Stratified K-Fold Cross Validation Method
kfold_cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
for train_index, test_index in kfold_cv.split(X_model,y_model):
X_train, X_test = X_model.iloc[train_index], X_model.iloc[test_index]
y_train, y_test = y_model.iloc[train_index], y_model.iloc[test_index]
```
## 5. 結論 Conclusion
**Table 1**
In-sample test
This is the accuracy for validation data.
| | Logistic(Benchmark result) | Random Forest |XGBoost
| --------| -------- | -------- |-------- |
| Accuracy | 0.866907 |0.954030|0.945595
out-of-sample test
Consider a simple 75%-25% split on the data.
| | Logistic(Benchmark result) | Random Forest |XGBoost|
| --------| -------- | -------- |-------- |
| Accuracy | 0.812152 |0.887894|0.873261|



隨機森林和XGBoost這兩個樹模型在該數據集上表現最好,隨機森林的準確度達到了88.9%,AUC面積也有0.95,可以通過調參來繼續提升模型的效果
---
## Reference
https://www.kaggle.com/jessemostipak/hotel-booking-demand