---
title: "Homework 2"
author: "Hamed Khoojinian, Yaassh Dhamani, Yun Deng, Xuan Ji, Samikshya Mishra, Ashwin Assysh Sharma"
date: "9/29/2019"
output:
pdf_document:
toc: T
toc_depth: 3
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
[toc]
```{r message = FALSE, warning=FALSE}
library(dplyr)
library(data.table)
library(lubridate)
library(tidyverse)
library(dplyr)
library(naniar)
library(cluster) # for Similarity and PAM
library(Rtsne) # for t-SNE plot
library(ggplot2) # for Visualizations
library(clustMixType) # for k-Prototypes
library(airportr)
```
# Business
## Our Task
There is intense competition in the airline carrier industry, but Sun Country, a small local player, managed to survive a series of threats in the past few decades. To improve its competitiveness, Sun Country came to our data science team hoping to improve its understanding of customer profile, and thus increase its revenue in the long run.
The challenge is that this is a broad topic. To increase the efficacy of our analysis and make our recommendations specific and actionable, we decided to focus on our potential high value customers. Our goal is to increase the quantity of sales to high value customers so that we can gain their loyalty over time. We plan to increase sales to these customers by sending them targetted emails containing offers and prices that are of interest to them.
`transform people who booked discount-first-class and first-class to Ufly members and increase revenues and customer loyalty.`
## Description of Business Problem
Our business problem is what insights could we gain from customer segmentations to help Sun Country promote customer loyalty and increase its profit margin among first-class customers.
## Logic Flow and Proper Metrics
Customers who can afford a first class ticket without a discount are those with higher spending power, and with all other things equal, should be of higher priority in our marketing efforts. Our data population was reduced to only those customers who at some point in the past 2 years bought an undiscounted first class ticket from us.
If these customers could be segmented meaningfully, we could try to find interests or similarities among them that could help our marketing efforts increase sales to these customers. Staff could be alerted at various points along the customer experience to encourage a positive experience for these VIPs. ____#____
``
We chose to focus on customers who booked first-class or discount-first-class because the profit margin and the customer lifetime value are higher for Ufly members who book discount-first-class and first-class than other subsets of customers. Therefore, we will run an initial clustering analysis of customers who booked at least one discount-first-class or first-class flight.
__ Customer segments can tell us about their interests, which will go into the targetted promotional emails, as well as tell us about their shopping habits which __`
# Data Preparation
## Data Overview
```{r}
data <- fread('SunCountry.csv', header = T, stringsAsFactors = T)
data <- as.data.frame(data)
summary(data)
```
There are a few irregularities in our dataset. For example, there are some missing values in `gender`, `age` and `birthdate` column, and noisy data in the `age` column where values are below 0 or above 100. Additionally, we had duplicate trips, and trips where customers did not fly with Sun Country. Detailed data processing procedures are provided below.
## Missing
```{r, eval=FALSE}
# dealing with missing values and outliers
clean1 <- data %>%
drop_na(c("GenderCode","Age","birthdateid")) %>%
filter(Age >=0, Age <= 100) %>%
filter(GenderCode %in% c("F","M"))
# replace UflyReward NAs with -1
clean2$UFlyRewardsNumber <- replace_na(clean4$UFlyRewardsNumber,-1)
```
* We dropped NA value in `gender`, `age`, and `birthdate` because we will not able to create unique passenger IDs with NA’s in these columns.
* We replaced UflyRewards NA values with -1 since we wanted to include these passengers in our clustering analysis, but NAs are not allowed in the k-Prototypes analysis.
## Outliers
* We dropped rows with `age` above 100 or below 0, since the possibility that a passenger’s age is above 100 is very low and it is impossible to have negative ages.
* We removed rows with `gender` other than Female and Male, since no clear definition of ‘U’ gender is given in the data dictionary.
## Data Filtering
```{r, eval=FALSE}
# use only SY flights
clean3 <- clean2 %>%
filter(MarketingAirlineCode=='SY')
# remove duplicates
clean4 <- distinct(clean3, PNRLocatorID, CouponSeqNbr, ServiceStartCity, ServiceEndCity,
PaxName, GenderCode, birthdateid, ServiceStartDate, .keep_all= TRUE)
# filter out PNR's whose SeqNum does not start with 1
df_MinCouponSeqNbr <- clean4 %>%
group_by(PNRLocatorID) %>%
summarize(MinCouponSeqNbr = min(CouponSeqNbr))
clean5 <- left_join(clean4,df_MinCouponSeqNbr,by="PNRLocatorID")
```
* We kept trips flying with Sun Country because we lack email information for customers that did not fly with SCA. Without contact information customer identificaiton for marketing purposes would be difficult.
* We removed duplicate flights by retaining only those rows which had a single occurrence of PNRLocatorID, CouponSeqNbr, PaxName, ServiceStartCity, ServiceEndCity, ServiceStartDate combination.
* We removed PNR locationID whose CouponSeqNbr did not start with 1 because PNR should always start with 1 and they were invalid data.
## Attribute Creation
```{r, eval=FALSE}
# add MaxSeqNumber
df_MaxCouponSeqNbr <- clean5 %>%
group_by(PNRLocatorID) %>%
summarize(MaxCouponSeqNbr = max(CouponSeqNbr))
clean6 <- left_join(clean5,df_MaxCouponSeqNbr,by="PNRLocatorID")
# calculate the fare amount
clean7 <- clean6 %>%
mutate(Amt = case_when((is.na(BaseFareAmt) & is.na(TotalDocAmt)) ~ 0,
TotalDocAmt >= BaseFareAmt ~ TotalDocAmt,
TotalDocAmt < BaseFareAmt ~ BaseFareAmt + TotalDocAmt,
TRUE ~ 0),
AmtPerFlight = Amt/MaxCouponSeqNbr)
# create unique customerID
clean7$PaxID <- paste(clean7$EncryptedName, clean7$birthdateid, clean7$GenderCode)
clean7$PNRCreateDate <- ymd(clean7$PNRCreateDate)
clean7$ServiceStartDate <- ymd(clean7$ServiceStartDate)
# days prebook
clean7$PlannedDays <- clean7$ServiceStartDate - clean7$PNRCreateDate
clean7$BookingChannel <- as.character(clean7$BookingChannel)
# true origin
true_origin <- clean7 %>%
group_by(PNRLocatorID, PaxID) %>%
filter(CouponSeqNbr == min(CouponSeqNbr)) %>%
select(PNRLocatorID, PaxID, CouponSeqNbr, ServiceStartCity)
true_origin$ServiceStartCity <- as.character(true_origin$ServiceStartCity)
# true destination
true_destination <- clean7 %>%
group_by(PNRLocatorID, PaxID) %>%
filter(CouponSeqNbr == max(CouponSeqNbr)) %>%
select(PNRLocatorID, PaxID, CouponSeqNbr, ServiceEndCity)
true_destination$ServiceEndCity <- as.character(true_destination$ServiceEndCity)
clean7 <- clean7 %>%
left_join(true_origin, by = c("PNRLocatorID" = "PNRLocatorID", "PaxID" = "PaxID")) %>%
left_join(true_destination, by = c("PNRLocatorID" = "PNRLocatorID", "PaxID" = "PaxID"))
clean7 <- distinct(clean7)
# south or north
# south and north preparation
# add latitude and flight distance (flight distance may be too computationally expensive)
distinct_start = data.frame(distinct(data, ServiceStartCity, .keep_all = FALSE))
distinct_end = data.frame(distinct(data, ServiceEndCity, .keep_all = FALSE))
# function to calculate latitude
cal_latitude <- function(airport) {
tryCatch(
{latitude = as.numeric(airport_location(airport, input_type = "IATA")[1])},
error=function(x)
{
return(NA)
}
)
}
distinct_start$start_latitude = apply(distinct_start, 1, cal_latitude)
distinct_end$end_latitude = apply(distinct_end, 1, cal_latitude)
clean7 <- clean7 %>%
left_join(distinct_start, by = c("ServiceStartCity.x" = "ServiceStartCity")) %>%
left_join(distinct_end, by = c("ServiceEndCity.x" = "ServiceEndCity")) %>%
drop_na(c("start_latitude","end_latitude")) %>%
mutate(StartCitySouth=ifelse(start_latitude<=38,1,0),
EndCitySouth = ifelse(end_latitude<=38,1,0))
# round trip
clean7 <- mutate(clean7, round_trip_flag = ifelse(clean7$ServiceStartCity.y == clean7$ServiceEndCity.y, 1, 0))
# one-way trip
clean7 <- mutate(clean7, one_way_trip_flag = ifelse(clean7$round_trip_flag == 0, 1, 0))
# seasonality
clean7$ServiceStartQuarter <- quarters(clean7$ServiceStartDate)
# Booking channel -> airport
clean7$BookingChannel <- ifelse(clean7$BookingChannel %in% c("XTM", "FCM", "MSP", "LAN", "UFO", "JFK", "DFW", "LAS", "MCO", "PSP", "MDW", "LAX", "MKE", "HRL", "ANC", "SFO", "RSW", "DCA", "BOS", "SEA", "PHX", "MSN", "GJT", "MIA"), "Airports", clean7$BookingChannel)
#dummy_cols(clean7, select_columns = c("BookingChannel"))
# group size
group_size <- clean7 %>%
group_by(PNRLocatorID) %>%
summarise(GroupSize = n_distinct(PaxID))
clean7 <- clean7 %>%
left_join(group_size, by = "PNRLocatorID")
test <- data.frame(model.matrix(~ clean7$BookingChannel - 1, data = clean7))
clean7 <- cbind(clean7, test)
# add vacation labels
vacation_labels <- read.csv('airport_vacation_label.csv')
vacation_labels <- airport_vacation_label
clean7 <- clean7 %>%
inner_join(vacation_labels,by=c('ServiceEndCity.x'='distinct_destinations')) %>%
rename(vaca_label = vaca, EndCityName = city)
# keep trips with customers who have ever flying with first class
clean7$binaryfirst <- ifelse(clean7$BkdClassOfService =='First Class'| clean7$BkdClassOfService =='Discount First Class', 1,0)
first_class_data <- clean7 %>%
group_by(PaxID) %>%
summarise(firstclassflag=max(binaryfirst)) %>%
filter(firstclassflag == 1)
clean7 <- clean7 %>%
inner_join(first_class_data, by = "PaxID")
clean7 <- clean7 %>%
filter(Age > 18)
sample_data <- clean7
```
* We created new variables to help us understand customer characteristics.
* `PaxId` - We created unique customerID (as PaxID) using `EncryptedName`, `Birthdate`, `Gender`.
* `AmtPerFlight` - Since the total amount of a PNR is shown for each sequence number, we calculated the Amt per flight as the total amount divided by the maximum sequence number per PNR.
* `Start and end city south/north` - Since part of Sun Country’s business is leisure flights, we added the latitude of the start and end cities and categorized as south or north.
* `Days pre-booked` - Days pre-booked is defined as the difference between the service start date and booking date
* `TrueOrigin` (group by PNR & customerID) and `TrueDestination` (for each customer within one PNR) - True origin and destination is the start and end city of the interlining tickets
* `RoundTrip` (for each customer within one PNR) and `One-wayTrip` - It is a round trip if true origin and destination are the same, otherwise one-way trip
* `Seasonality` (service was fulfilled in which quarter) - Seasonality (Q1-Q4) is generated from the service start date
* `BookingChannels` - We created dummy variables for booking channels, and different airport names are replaced by ‘Airport’.
* `GroupSize` - Group size is the number of customers traveling together, which is derived from the number of customers in each PNR number.
* After calculating the group size, we **selected only passengers who flew first-class or discount first-class and excluded all passengers under 18 years old**, since it is very likely that they are dependent on their parents and they cannot be the direct marketing targets.
## Data Aggregation
```{r, eval=FALSE}
customer_df <- sample_data %>%
select(PNRLocatorID,TicketNum,CouponSeqNbr.x,PaxName,EncryptedName,GenderCode,Age,
UflyMemberStatus,CardHolder,StopoverCode,
MaxCouponSeqNbr,AmtPerFlight,MinCouponSeqNbr,PaxID,PlannedDays,round_trip_flag,
one_way_trip_flag,GroupSize,
sample_data.BookingChannelAirports,
sample_data.BookingChannelOutside.Booking,
sample_data.BookingChannelReservations.Booking,
sample_data.BookingChannelSCA.Website.Booking,
sample_data.BookingChannelSY.Vacation,
sample_data.BookingChannelTour.Operator.Portal,
sample_data.BkdClassOfServiceCoach,
sample_data.BkdClassOfServiceDiscount.First.Class,
sample_data.BkdClassOfServiceFirst.Class,
sample_data.ServiceStartQuarterQ1,
sample_data.ServiceStartQuarterQ2,
sample_data.ServiceStartQuarterQ3,
sample_data.ServiceStartQuarterQ4,
StartCitySouth,
EndCitySouth,
vaca_label,
binaryfirst) %>%
rename(CouponSeqNbr = CouponSeqNbr.x,
BookingChannel_Airports = sample_data.BookingChannelAirports,
BookingChannel_Outside_Booking=sample_data.BookingChannelOutside.Booking,
BookingChannel_Reservations_Booking = sample_data.BookingChannelReservations.Booking,
BookingChannel_SCA_Website_Booking = sample_data.BookingChannelSCA.Website.Booking,
BookingChannel_SY_Vacation=sample_data.BookingChannelSY.Vacation,
BookingChannel_TourOperator_Portal=sample_data.BookingChannelTour.Operator.Portal,
BkdClassOfService_Coach=sample_data.BkdClassOfServiceCoach,
BkdClassOfService_Discount_First_Class=sample_data.BkdClassOfServiceDiscount.First.Class,
BkdClassOfService_First_Class=sample_data.BkdClassOfServiceFirst.Class,
ServiceStartQ1 = sample_data.ServiceStartQuarterQ1,
ServiceStartQ2 = sample_data.ServiceStartQuarterQ2,
ServiceStartQ3 = sample_data.ServiceStartQuarterQ3,
ServiceStartQ4 = sample_data.ServiceStartQuarterQ4) %>%
group_by(PaxID) %>%
summarize(GenderCode = first(GenderCode),
Age = first(Age),
CardHolder = first(CardHolder),
UflyMemberStatus = first(UflyMemberStatus),
NumOfLayover = sum(StopoverCode=='O', na.rm=T),
NumOfStopover = sum(StopoverCode=='X', na.rm=T),
SumAmt = sum(AmtPerFlight, na.rm=T),
AvgPlannedDays = mean(PlannedDays, na.rm=T),
CntRoundTrip = sum(round_trip_flag, na.rm=T),
CntOneWayTrip = sum(one_way_trip_flag, na.rm=T),
Cnt_BC_Airport = sum(BookingChannel_Airports, na.rm=T),
Cnt_BC_OutSideBooking = sum(BookingChannel_Outside_Booking, na.rm=T),
Cnt_BC_Reservations_Booking = sum(BookingChannel_Reservations_Booking, na.rm=T),
Cnt_BC_SCA_Website_Booking = sum(BookingChannel_SCA_Website_Booking, na.rm=T),
Cnt_BC_SY_Vacation = sum(BookingChannel_SY_Vacation, na.rm=T),
Cnt_BC_TourOperator_Portal = sum(BookingChannel_TourOperator_Portal, na.rm=T),
CntBkdClass_Coach = sum(BkdClassOfService_Coach, na.rm=T),
CntBkdClass_DiscountFirstClass = sum(BkdClassOfService_Discount_First_Class, na.rm=T),
CntBkdClass_FirstClass = sum(BkdClassOfService_First_Class, na.rm=T),
CntQ1 = sum(ServiceStartQ1, na.rm=T),
CntQ2 = sum(ServiceStartQ2, na.rm=T),
CntQ3 = sum(ServiceStartQ3, na.rm=T),
CntQ4 = sum(ServiceStartQ4, na.rm=T),
AvgGroupSize = mean(GroupSize, na.rm=T),
SumStartCitySouth = sum(StartCitySouth, na.rm=T),
SumEndCitySouth = sum(EndCitySouth,na.rm=T),
Numofvacation = sum(vaca_label,na.rm=T),
numfirclasstra = sum(binaryfirst,na.rm=T))
```
* We aggregated the transactional-level data to customer-level data. We took the first instance of gender, age, and Ufly member status of each customer. We counted the number of layovers, stopovers, amount per flight, round-trips, one-way trips. Also, we counted how many times each passenger booked from each booking channel, in which quarter, and in which classes, from and to a southern city. We averaged the planned days and the group size per passenger.
```{r, eval=FALSE}
normalize <- function(x){
return ((x - min(x))/(max(x) - min(x)))}
customer_df2 <- customer_df1 %>%
select(-PaxID) %>%
mutate(Age = normalize(Age),
NumOfLayover = normalize(NumOfLayover),
NumOfStopover = normalize(NumOfStopover),
SumAmt = normalize(SumAmt),
AvgPlannedDays = normalize(AvgPlannedDays),
CntRoundTrip = normalize(CntRoundTrip),
CntOneWayTrip = normalize(CntOneWayTrip),
Cnt_BC_Airport = normalize(Cnt_BC_Airport),
Cnt_BC_OutSideBooking = normalize(Cnt_BC_OutSideBooking),
Cnt_BC_Reservations_Booking = normalize(Cnt_BC_Reservations_Booking),
Cnt_BC_SCA_Website_Booking = normalize(Cnt_BC_SCA_Website_Booking),
Cnt_BC_SY_Vacation = normalize(Cnt_BC_SY_Vacation),
Cnt_BC_TourOperator_Portal = normalize(Cnt_BC_TourOperator_Portal),
CntBkdClass_Coach = normalize(CntBkdClass_Coach),
CntBkdClass_DiscountFirstClass = normalize(CntBkdClass_DiscountFirstClass),
CntBkdClass_FirstClass = normalize(CntBkdClass_FirstClass),
CntQ1 = normalize(CntQ1),
CntQ2 = normalize(CntQ2),
CntQ3 = normalize(CntQ3),
CntQ4 = normalize(CntQ4),
AvgGroupSize = normalize(AvgGroupSize),
SumStartCitySouth = normalize(SumStartCitySouth),
SumEndCitySouth = normalize(SumEndCitySouth),
Numofvacation = normalize(Numofvacation)
)
```
* After the aggregation, we normalized the numeric data using min-max to avoid the situation where attributes with large scales swamping attributes with small scales.
## Clean Data Summary
```{r}
summary(customer_df2)
```
## Analysis of Customers Clusters
### Description and Rationale for the Chosen Analysis
Given our main focus is on customers taking first-class flight, we want to segment these customers into heterogeneous groups, and look into potential attributes that lead to the diversity in consumer behaviors. Therefore, we performed clustering analysis among all customers who have booked first-class tickets, so that we can develop different marketing strategies tailored towards customers in different clusters.
### Execution and results
We found first-class customers can be classified into 4 groups. Based on our analysis, cluster 4 has the least amount of customers, but customers in this cluster spend more and fly more frequently with Sun Country. A large proportion of customers in cluster 4 are ufly members and they are prone to book their tickets from website of Sun Country.
```{r, echo=False}
library(tidyverse)
setwd("")
clusteredData <- read.csv("ClusteredData1_denormalized.csv")
clusteredData2 <- clusteredData
# replace(clusteredData$UflyMemberStatus,c(""),c("NotAMember"))
# count of membership status by cluster
clusteredData3 <- clusteredData2 %>%
group_by(cluster, UflyMemberStatus) %>%
summarise(n=n())
# replace "" by NotAMember in UflyMemberStatus
clusteredData3$UflyMemberStatus <- as.character(clusteredData3$UflyMemberStatus)
clusteredData3$UflyMemberStatus <- ifelse(clusteredData3$UflyMemberStatus=="","NotAMember",clusteredData3$UflyMemberStatus)
# head(clusteredData3)
# number of people per cluster
clustersum <- clusteredData2 %>%
group_by(cluster) %>%
summarise(total=n())
clusteredData4 <- clusteredData3 %>%
left_join(clustersum, by="cluster") %>%
mutate(percet=n/total)
clusteredData4$cluster <- as.character(clusteredData4$cluster)
ggplot(clusteredData4, aes(x=UflyMemberStatus,y=percet,fill=cluster))+
geom_bar(position="dodge",stat="identity")+
ylab("percentage")+
ggtitle("UflyMember Status across clusters (%)")+
theme(plot.title=element_text(hjust = 0.5,size=15,face="bold"))
# book from sc website
bookings <- clusteredData %>%
select(PaxID,Cnt_BC_Airport,Cnt_BC_OutSideBooking, Cnt_BC_Reservations_Booking, Cnt_BC_SCA_Website_Booking, Cnt_BC_SY_Vacation, Cnt_BC_TourOperator_Portal) %>%
group_by(PaxID) %>%
mutate(totalBooking=Cnt_BC_Airport+Cnt_BC_OutSideBooking+ Cnt_BC_Reservations_Booking+ Cnt_BC_SCA_Website_Booking+ Cnt_BC_SY_Vacation+ Cnt_BC_TourOperator_Portal) %>%
select(PaxID,totalBooking)
clusteredData5 <- clusteredData %>%
left_join(bookings, by="PaxID") %>%
select(cluster,Cnt_BC_SCA_Website_Booking,totalBooking)
clusteredData6 <- clusteredData5 %>%
group_by(cluster) %>%
mutate(totalWebsiteBooking=sum(Cnt_BC_SCA_Website_Booking))%>%
mutate(totalBooking2=sum(totalBooking))%>%
select(cluster,totalWebsiteBooking,totalBooking2)%>%
distinct()%>%
arrange(cluster)%>%
mutate(perct_webBooking=totalWebsiteBooking/totalBooking2)
ggplot(clusteredData6,aes(x=cluster,y=perct_webBooking,fill=cluster))+
geom_bar(position="dodge",stat="identity")+
ggtitle("Booking From Sun Country Website (%)")+
ylab("Percentage")+
theme(plot.title=element_text(hjust = 0.5,size=15,face="bold"))
```
### Interpretation
To understand what factor(s) drive the ‘abnormal’ behavior of customers in cluster 4, we looked into each customer attribute, and found two attributes that distinguish cluster 4 from the rest clusters. The majority of customers in cluster 4 are Ufly members, whereas most customers in cluster 123 did not enroll in UflyRewards program. In addition, a significantly higher percentage (around 80%) of customers in cluster 4 booked their tickets from Sun Country website, as compared to cluster 1,2 and 3 (around 70%).
### Conclusion
It is reasonable for Sun Country to promote enrollment in Ufly rewards Program to encourage first-class customers to fly with Sun Country. We could introduce Ufly-member-only travel package to stimulate the subscription. Details about travel packages will be provided in the next section. In addition, being Ufly members might encourage customers to book their tickets from Sun Country’s website, rather than third parties.
## Targeting the Potential Customer Segment
*Description and Rationale for the Chosen Analysis*
Given that all clusters are lucrative for Sun Country, we could send them emails to persuade them to join the membership. However, since they did not join before, we need to differentiate our membership program and provided benefits that they want. We could differentiate from two perspectives, low price and differentiation in packages, campaigns, and services. Since Sun Country did not follow the low-cost strategy, we chose to recommend what align with their strategy. Thus, we investigated the destinations and seasonality of each cluster to see whether we could target them using campaigns tailored to their traveling patterns.
*Execution and Results (including code)*
We investigated all seasons for all clusters. Since the destinations of cluster 1 and 2 do not vary across seasons, we showed the destinations of the entire span of the data (i.e., two years). For cluster 3 and 4, their destinations are different in summer and winter, and the distribution of destination is more uniform in the spring and the fall. Therefore, we only included the plots for summer and winter.
```{r, echo=False}
# read data
data <- fread('first_class_trip.csv', header = T, stringsAsFactors = T)
customer <- read.csv('ClusteredData1_denormalized.csv', header = T)
# plot where cluster 1, 2, 3, 4's destinations
customer_destination <- customer %>%
inner_join(data, by=c('PaxID'='PaxID'))
```
```{r}
# cluster1
# end city
end1 <- customer_destination %>%
filter(cluster.y==1 & ServiceEndCity.y!='MSP') %>%
select(ServiceEndCity.y) %>%
rename(ServiceEndCity = ServiceEndCity.y) %>%
group_by(ServiceEndCity) %>%
summarize(cnt = n()) %>%
top_n(10, cnt)
ggplot(end1, aes(x=reorder(ServiceEndCity, -cnt), y=cnt)) +
geom_bar(position="dodge", stat="identity") +
ggtitle('End Cities of Cluster1') +
xlab('Service End City') +
ylab('Number of Flights') +
theme_bw() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
# cluster2
end2 <- customer_destination %>%
filter(cluster.y==2 & ServiceEndCity.y!='MSP') %>%
select(ServiceEndCity.y) %>%
rename(ServiceEndCity = ServiceEndCity.y) %>%
group_by(ServiceEndCity) %>%
summarize(cnt = n()) %>%
top_n(10, cnt)
ggplot(end2, aes(x=reorder(ServiceEndCity, -cnt), y=cnt)) +
geom_bar(position="dodge", stat="identity") +
ggtitle('End Cities of Cluster2') +
xlab('Service End City') +
ylab('Number of Flights') +
theme_bw() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
# cluster3
# Winter
end3_winter <- customer_destination %>%
filter(cluster.y==3 & ServiceEndCity.y!='MSP') %>%
mutate(ServiceStartMonth = month(ServiceStartDate)) %>%
filter(ServiceStartMonth %in% c(11,12,1,2, 3)) %>%
select(ServiceEndCity.y) %>%
rename(ServiceEndCity = ServiceEndCity.y) %>%
group_by(ServiceEndCity) %>%
summarize(cnt = n()) %>%
top_n(10, cnt)
ggplot(end3_winter, aes(x=reorder(ServiceEndCity, -cnt), y=cnt)) +
geom_bar(position="dodge", stat="identity") +
ggtitle('End Cities of Cluster3 from November to March') +
xlab('Service End City') +
ylab('Number of Flights') +
theme_bw() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
# Summer
end3_summer <- customer_destination %>%
filter(cluster.y==3 & ServiceEndCity.y!='MSP') %>%
mutate(ServiceStartMonth = month(ServiceStartDate)) %>%
filter(ServiceStartMonth %in% c(6, 7, 8)) %>%
select(ServiceEndCity.y) %>%
rename(ServiceEndCity = ServiceEndCity.y) %>%
group_by(ServiceEndCity) %>%
summarize(cnt = n()) %>%
top_n(10, cnt)
ggplot(end3_summer, aes(x=reorder(ServiceEndCity, -cnt), y=cnt)) +
geom_bar(position="dodge", stat="identity") +
ggtitle('End Cities of Cluster3 from June to August') +
xlab('Service End City') +
ylab('Number of Flights') +
theme_bw() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
# cluster4
# Winter
end4_winter <- customer_destination %>%
filter(cluster.y==4 & ServiceEndCity.y!='MSP') %>%
mutate(ServiceStartMonth = month(ServiceStartDate)) %>%
filter(ServiceStartMonth %in% c(11,12,1,2,3)) %>%
select(ServiceEndCity.y) %>%
rename(ServiceEndCity = ServiceEndCity.y) %>%
group_by(ServiceEndCity) %>%
summarize(cnt = n()) %>%
top_n(10, cnt)
ggplot(end4_winter, aes(x=reorder(ServiceEndCity, -cnt), y=cnt)) +
geom_bar(position="dodge", stat="identity") +
ggtitle('End Cities of Cluster4 from November to March') +
xlab('Service End City') +
ylab('Number of Flights') +
theme_bw() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
# Summer
end4_summer <- customer_destination %>%
filter(cluster.y==4 & ServiceEndCity.y!='MSP') %>%
mutate(ServiceStartMonth = month(ServiceStartDate)) %>%
filter(ServiceStartMonth %in% c(6, 7, 8)) %>%
select(ServiceEndCity.y) %>%
rename(ServiceEndCity = ServiceEndCity.y) %>%
group_by(ServiceEndCity) %>%
summarize(cnt = n()) %>%
top_n(10, cnt)
ggplot(end4_summer, aes(x=reorder(ServiceEndCity, -cnt), y=cnt)) +
geom_bar(position="dodge", stat="identity") +
ggtitle('End Cities of Cluster4 from June to August') +
xlab('Service End City') +
ylab('Number of Flights') +
theme_bw() +
theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
```
*Interpretation*
On one hand, cluster 1 and 2 travelled to metropolitan regions (e.g., New York, Los Angeles, San Francisco) most frequently in all seasons. On the other hand, cluster 3 and 4 travelled to different types of destinations in different seasons. They travelled to metropolitan areas in the summer and leisure-and-warm destinations in the winter.
*Conclusion*
Given their travel pattern, we propose that Sun Country could collaborate with hotels and agents in different areas in different seasons and design Ufly-member-only travel packages to target different clusters. The choice of city and the number of travel agents and hotels in each city will be subject to the budget of Sun Country.
Cluster 1 and 2 can be targeted by collaboration with hotels and agents in metropolitan areas such as New York, Los Angeles, San Francisco, Washington D.C., Dallas, Lancing, and Boston.
For cluster 3 and 4, Sun Country could collaborate with hotels and travel agents in Orlando, Palm Springs, Las Vegas, Los Angeles, Phoenix, and the City of Harlingen in the winter (from November to March).
In summer (from June to August), Sun Country could target cluster 4 by collaborating with hotels and agents in Los Angeles, San Francisco, New York, Dallas and target cluster 3 by collaborating with hotels and agents in Los Angeles, Seattle, New York, San Francisco, Boston.
# Final Takeaways
## Conclusion
Customers who travel with Sun Country more often are those who enroll in UflyRewards Program
Frequent fliers are prone to book their tickets from Sun Country’s website
Sun Country could launch different travel packages tailored towards customers in each cluster based on season. Package types are shown as below:
Clusters and Season
| Clusters and Season | Destination of Travel Packages |
| -------- | -------- |
|Cluster 1 & 2 (all year round)|Metropolitan areas|
|Cluster 3 & 4 (summer)|Orlando, Palm Springs, Las Vegas, Los Angeles, Phoenix, and the City of Harlingen|
|Cluster 3 (winter)| Los Angeles, Seattle, New York, San Francisco, Boston|
|Cluster 4 (winter)| Los Angeles, San Francisco, New York, Dallas|
## Final Recommendation
We believe Sun Country could greatly increase its profit margin by promoting Ufly membership among (infrequent) first-class customers through the Ufly-member-only travel package. In this way, customers are encouraged to subscribe as Ufly member to enjoy the benefits of the travel packages (e.g. discount on flight tickets, hotels and local guide). Once they are part of the membership, they are likely to travel more frequently with Sun Country, and book their tickets from Sun Country’s website. As a result, Sun Country could increase revenue and customer lifetime value because of the increase in customer loyalty and switching costs, and decrease its costs on commission fee paid to third party websites.
# Appendix