--- title: Project14 tags: teach:MF --- ###### tags: `機器學習與金融科技` # ML and FinTech: Project by 宋昱奇 > **keywords**:property、Machine Learning ## 1. Motivations Hotel Booking Cancelation Prediction * It would be nice for the hotels to have a model to predict if a guest will actually come.This can help a hotel to plan things like personel and food requirements. Maybe some hotels also use such a model to offer more rooms than they have to make more money... --- ## 2. EDA * Feature descriptions: Provide explanations of each feature | Feature name | Explanation | | ------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------ | | Hotel | (H1 = Resort Hotel or H2 = City Hotel) | | | is_canceled | if the booking was canceled (1) or not (0) | | | lead_time | 预订日期和到达日期之间经过的天数 | | | arrival_date_year | Year of arrival date | | | arrival_date_month | Month of arrival date | | | arrival_date_week_number | Week number of year for arrival date | | | arrival_date_day_of_month | Day of arrival date | | | stays_in_weekend_nights | Number of weekend nights (Saturday or Sunday) the guest stayed or booked to stay at the hotel | | | stays_in_week_nights | Number of week nights (Monday to Friday) the guest stayed or booked to stay at the hotel | | | adults | Number of adults | | | children | Number of children | | | babies | Number of babies | | | meal | Type of meal booked.BB–Bed & Breakfast;HB–Half board (breakfast and one other meal;FB – Full board | | | country | Country of origin. | | | market_segment | In categories, the term “TA” means “Travel Agents” and “TO” means “Tour Operators” | | | distribution_channel | Booking distribution channel. The term “TA” means “Travel Agents” and “TO” means “Tour Operators” | | | is_repeated_guest | Value indicating if the booking name was from a repeated guest (1) or not (0) | | | previous_cancellations | Number of previous bookings that were cancelled by the customer prior to the current booking | | | previous_bookings_not_canceled | Number of previous bookings not cancelled by the customer prior to the current booking | | | reserved_room_type | Code of room type reserved. | | | assigned_room_type | Code for the type of room assigned to the booking. | | | booking_changes | Number of changes/amendments made to the booking from the moment the booking was entered on the PMS until the moment of check-in or cancellation | | | deposit_type | Indication on if the customer made a deposit to guarantee the booking. | | | agent | ID of the travel agency that made the booking | | | company | ID of the company | | | days_in_waiting_list | Number of days the booking was in the waiting list before it was confirmed to the customer | | | customer_type | Type of booking, assuming one of four categories:Contract\Group\Transient\Transient-party | | | adr | Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights | | | required_car_parking_spaces | Number of car parking spaces required by the customer | | | total_of_special_requests | Number of special requests made by the customer (e.g. twin bed or high floor) | | | reservation_status | Reservation last status | | | reservation_status_date | Date at which the last status was set. | | * Briefly browse the data ![](https://i.imgur.com/xFHn7jT.png) ![](https://i.imgur.com/7dc7hK7.png) ``` df.hist(figsize=(20,15)) plt.show() ``` ![](https://i.imgur.com/ehEuG7C.png) 2.1酒店預定量與取消量 ![](https://i.imgur.com/iGK4lCi.png) City Hotel的预定量与取消量都高于Resort Hotel,但Resort Hotel取消率为27.8%,而City Hotel的取消率达到了41.7% 2.2各個月份預定量和取消率 ![](https://i.imgur.com/1QSi9Rg.png) ![](https://i.imgur.com/V2b4hmX.png) 2.3客戶類型 ![](https://i.imgur.com/4XIriQd.png) 2.4預定途徑 ![](https://i.imgur.com/qwZqzn5.png) 2.5各類旅客日均開銷 ![](https://i.imgur.com/jl8Gsty.png) City Hotel各類客戶的日均開銷均高於Resort Hotel;在四種類型的客戶中,散客(Transient)的消費最高,團體客(Group)最低 2.6 Top 10 Country of Origin graph ![](https://i.imgur.com/ZceghjN.png) 2.7房間類型與預定取消量 ![](https://i.imgur.com/uc90SJP.png) 在預定量前7的房型中,A、D房型的取消率均高於其他房型,A房型的取消率更是高達44.5% --- Brief summary 1.City Hotel的預定量和取消率都遠高於Resort Hotel。 2.新客的取消率比老客高24%,因此,酒店應重點關注新客的預訂與入住體驗 3.不退押金(Non Refund)這一類型的取消預訂率高達99% 4.A、D房型的取消率遠高於其他房型。 5.酒店應利用好每年78月的旅遊旺季,可以在保證服務質量的同時適當提高價格獲取更多利潤,在淡季(冬季)的時候進行優惠活動,如聖誕大促和新年活動,減少酒店空房率。 --- ## 3. Problem formulation Problem formulation We would like to predict y from x=(x1,x2,...,xp)with p=31 $$Y = \left\{\begin{array}{ll}1,&\mbox{is canceled},\\0,&\mbox{no canceled}\end{array}\right.$$ We use Logistic regression as a benchmark method 對於訓練和測試數據,我們用accuracy值進行模型比較。 ## 4. 資料分析 Data analysis ![](https://i.imgur.com/4g9zx84.png) 預訂狀態('reservation_status')與是否取消預訂的相關性最高,達到了0.92,但考慮到後續可能會導致模型過擬合,所以刪除 We use K-Fold cross validation (K=5) to validate our model ``` # Stratified K-Fold Cross Validation Method kfold_cv=StratifiedKFold(n_splits=5, random_state=42, shuffle=True) for train_index, test_index in kfold_cv.split(X_model,y_model): X_train, X_test = X_model.iloc[train_index], X_model.iloc[test_index] y_train, y_test = y_model.iloc[train_index], y_model.iloc[test_index] ``` ## 5. 結論 Conclusion **Table 1** In-sample test This is the accuracy for validation data. | | Logistic(Benchmark result) | Random Forest |XGBoost | --------| -------- | -------- |-------- | | Accuracy | 0.866907 |0.954030|0.945595 out-of-sample test Consider a simple 75%-25% split on the data. | | Logistic(Benchmark result) | Random Forest |XGBoost| | --------| -------- | -------- |-------- | | Accuracy | 0.812152 |0.887894|0.873261| ![](https://i.imgur.com/3wfrXkU.png) ![](https://i.imgur.com/2MbHtCg.png) ![](https://i.imgur.com/jv8LMCQ.png) 隨機森林和XGBoost這兩個樹模型在該數據集上表現最好,隨機森林的準確度達到了88.9%,AUC面積也有0.95,可以通過調參來繼續提升模型的效果 --- ## Reference https://www.kaggle.com/jessemostipak/hotel-booking-demand