<h1> <div align=center>MILLE LACS GRAND CASINO </div> <br> Direct Marketing Analysis</h1>
> Author : TEAM HEARTS for Carlson School of Management, UMN
[TOC]
<h2> 1. Overview </h2>
This documentation supports the experiential live case completed in Fall 2019 for MLCV, as a part of the Master of Science in Business Analytics (MSBA) program coursework. This document will lay out all the details of the analysis presented to the MLCV executive team in person on 12/10/19.
The following analysis will allow the team at MLCV to recreate and implement the suggested solutions. External data gathered is described as well as locations and methods for finding and recreating these sets from open government sources. Solutions prescribe coupon marketing changes targeted at customer segments according to distance from the casino and customer age demographics. Secondary recommendations demonstrate methods for scraping and analyzing google review sentiment from guests. All prescribed solutions are aimed at providing recommendations towards reversing the recurrent 9% year over year decline in headcount. Should you have further questions about this documentation, please contact:
* <b>Kexin Liang</b> or <b>Chia-Hsuan Chou</b>, for coupon issuance and redeemption analysis
* <b>Patrick Hoban</b> or <b>Rekha Mohandass</b>, for study on demographic shifts and its impact in the Minneapolis region
* <b>Nitin Wahie</b> or <b>Aadarsh Gupta</b>, for analysis on customer reviews and sentiments
<h2>2. Problem Definition and Introduction</h2>
<h3>Situation </h3>
MLCV owns two key hotel and gaming properties in the heart of MN. They have enjoyed a relatively short albeit prosperous history hosting 2.3 million guest trips each year.
The Hinckley property opened first in 1991. Followed quickly by the Mille Lacs destination in 1992. The properties include a number of ammenities beyond the gaming floor and hotel. There is also a spa, swimming pool, hot-tub, a newly renovated golf course, shopping and dining on property a newly updated arcade for the kids and even a doggy daycare. So the whole family can come along and take part in the fun.
<h3>Background and Context</h3>
Lately however, despite the marketing teams best efforts, which include sending out direct mail offerings in excess of $600 Million each year to their beloved guests and players. Guests are disappearing at a rate of 9% a year.
In part this could be due to the massive construction and remodeling efforts going on at both properties for the past few years. Still given the current downward trend, the team at Mille Lacs has asked us to assist in analyzing the marketing offers and providing insights on ways to better reach their target guests.
<h3>Problem Statement</h3>
In other words… Given the 9% year over year headcount reduction the facilities have been experiencing, how can the Casino team better leverage marketing offers and promotions to turn this trend around?
<h2>3. Technical Specifications</h2>
<h3>Environment</h3>
For data security, the analysis was run on a virtual machine set up by UMN techinal crew. Below is the environment information for the virtual machine.
#### Interpreter
```python=
Version : 3.7.3
Version tuple: ('3', '7', '3')
Compiler : MSC v.1915 64 bit (AMD64)
Build : ('default', 'Mar 27 2019 17:13:21')
```
#### Platform
```python=
Normal : Windows-10-10.0.14393-SP0
Aliased : Windows-10-10.0.14393-SP0
Terse : Windows-10
```
#### Operating System and Hardware Info
```python=
uname : uname_result(system='Windows',node='csom-msba-slc6',
release='10', version='10.0.14393', machine='AMD64',
processor='Intel64 Family6 Model79 Stepping 1, GenuineIntel')
system : Windows
node : csom-msba-slc6
release : 10
version : 10.0.14393
machine : AMD64
processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel
```
#### Executable Architacture
```python=
interpreter : ('64bit', 'WindowsPE')
/bin/ls : ('64bit', '')
```
<h3>Softwares and tools</h3>
We used python, R and Tableau in the project. See below for version information and software.
#### Software information
```python=
Python version : 3.7.3 (default, Mar 27 2019, 17:13:21)
[MSC v.1915 64 bit (AMD64)]
Version info : sys.version_info(major=3, minor=7, micro=3,
releaselevel='final', serial=0)
Tableau version: 2019.2.2(20192.19.0718.1543) 64-bit
R version: 3.6.1
```
#### Python Packages
```python=
import pandas as pd
import nltk
from nltk.corpus import stopwords
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os
import re
import numpy as np
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from operator import itemgetter
nltk.download('punkt')
import docx
from docx.shared import Inches
from textblob import TextBlob
import datetime as dt
import seaborn as sns
```
#### R Libraries #####
```R=
library(tm)
library(wordcloud)
library(SnowballC)
library(devEMF)
library(RWeka)
```
<h2>4. Solution Overview & Methodology</h2>
### Solution Overview
In order to solve the decreasing headcount in MLCV Grand Casinos, we will provide strategies around marketing offers designed to attract more players to the casino, as well as strategies designed to enhance customer experience in order to improve customer retention in the long-term. The solutions here-in can be divided into 2 parts: distance based marketing strategies and age based customer experience improvements. This project also leverages external data that complements the internal analysis and helps better quantify the outcome in terms of the macro economic environment and industry at large.
### Necessary Files
The knowledge transfer folder includes the files listed here, with which MLCV can reproduce the analysis:
* Code
* *sentiment.py*: Contains the script to perform raw data extraction from Google Reviews
* *wordcloud.R*: Contains the script to create a weightage-based word clouds
* Distance Based Analysis.ipynb: Contains the python script to perform data extract, data engineering and analysis for distance-based coupon analysis.
* Visualizations
* *Dashboard2.twb*: Tableau workbook to visualize the age-based analysis
* *census_data_graphs.twb:* Tableau workbook containing visuals for the demographic shifts in population, age and income in Minnesota
* Presentation
* *Live_case_presentation_slides.pptx*: Presentation walkthrough the of the problem and proposed solution, along with the recommendations
* Data
* *census_income_by_county_2010_to_2017.csv:* Income breakdown for each county in the Minnesota state with different levels and slices of detail for 2010 and 2017 (as provided by US Census)
* *population_age_by_county.csv:* Age breakdown for each county in the Minnesota state, alongwith Population
### Methodology
The methodology contains 3 parts:
**1. Demographic analysis using Census data:** Use external census data provided by Minnesota State Demographic Center to analyze the population, income, employment status change in the latest 10 years.
**2-1. Coupon analysis: Distance based analysis:** Understand the spending and marketed offer effect on different distance groups- 0-45, 45-60, 60-90 and 90-120 miles radius from casino.
**2-2. Coupon analysis: Age based analysis:** Understand the spending and marketed offer effect on various age groups-18-29, 30-49, 50+.
**3. Customer review analysis:** Use google review crawled from Grand Casino google page and perform sentiment analysis on reviews to understand clients' opinions about the casinos.
<h2>5. Data Engineering</h2>
### Data Extraction and Preparation
There are 5 datasets in total which are in the DimCoupon.csv, CouponGroup.csv, DimPlayer.csv, FactCouponRedeem.csv and PlayerDay.csv files.
The player information and coupon information are in separate tables that need to be combined for further analysis.
#### New dataset coupongroup_player_features.csv
```python=
import pandas as pd
pd.set_option('display.max_columns', None)
coupongroup = pd.read_csv('\\CouponGroup.csv')
dimcoupon = pd.read_csv('\\DimCoupon.csv', encoding = 'ISO-8859-1')
dimplayer = pd.read_csv('\\DimPlayer.csv', low_memory = False)
```
<br>
```python=
dimcoupon_useful = dimcoupon.drop(['Description', 'Type',
'ValueType', 'TypeCategory','SubType',
'TargetPopKey', 'ProgramTypeKey'],
axis = 1)
coupongroup_useful = coupongroup.drop(['CouponGroupID', 'PM_Create_Date',
'PM_Last_Update'], axis = 1)
dimplayer['State'] = dimplayer['State'].str.strip()
dimplayer_useful = dimplayer.drop(['PrimaryPropertyID', 'TierID',
'BirthMonth','DMA Flag', 'Prison Flag', 'Deceased Flag',
'PIQ Match Type', 'NCOA MOVE DATE',
'NCOA Move Type'], axis = 1)
coupongroup_useful.to_csv('coupongroup_useful.csv',index = False)
dimcoupon_useful.to_csv('dimcoupon_useful.csv',index = False)
dimplayer_useful.to_csv('dimplayer_useful.csv',index = False)
```
```python=
coupongroup_player = pd.merge(coupongroup, dimplayer, on = 'PlayerID')
coupongroup_player_features = pd.merge(coupongroup_player, dimcoupon,
on = 'CouponID')
coupongroup_player_features.head(3)
coupongroup_player_features.to_csv('\\coupongroup_player_features.csv',
index=False)
```
#### New dataset player_Dim_Day.csv
```python=
PlayerDay = pd.read_csv('\\PlayerDay.csv', low_memory = False)
DimPlayer = pd.read_csv('\\DimPlayer.csv', low_memory = False)
# Transfer Date_Value column into datetime format for following analysis
PlayerDay['Date_Value'] = PlayerDay['Date_Value'].astype(str)
PlayerDay['Date_Value'] = pd.to_datetime(PlayerDay['Date_Value'])
# Extract year value from time column
PlayerDay['Date_Year'] = PlayerDay['Date_Value'].map(lambda x: x.year)
# Merge playerDay and DimPlayer table for following analysis
player_Dim_Day = pd.merge(PlayerDay, DimPlayer, on='PlayerID', how='left')
```
### External Sources: Census Data
Census data was pulled at a county level for the state of Minnesota for 2010 and 2017. Data csv files can be found on the state run government websites.
Income Metrics by County:
https://mn.gov/admin/demography/data-by-topic/income-poverty/
Population by Age by County:
https://mn.gov/admin/demography/data-by-topic/age-race-ethnicity/
Which in turn link to the federally available government website data sets:
Income:
https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk
Population by Age:
https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?src=bkmk
By choosing download on these files. You can pull this data down to your local machine. Additionally you can choose other years to better suit your specific needs going forward.
Cleaned copies of this same data with unnecessary columns removed can also be found in the file repository provided for this project.
<h2>6. Analysis and Visualizations</h2>
#### Tool Setup
The analysis visualization tools we used were Tableau, Python and R. Tableau creates visualizations in the form of dashboards and sheets. The creation process is made extremely simple with its ‘drag and drop’ capabilities. We also put Python and R code in this part.
We connected Tableau to DimCoupon.csv, CouponGroup.csv, DimPlayer.csv FactCouponRedeem.csv and PlayerDay.csv for our analyses. To connect to the data sources, select Data > New Source > Other File and then select the files as specified in each of the sheet.
<h3>Demographic Analysis </h3>
#### Graph 1: Population growth across Minnesota
##### Explanation
We know that the headcount decrease is not probably because the change in coupons, it could be due to the economic environment. So We pulled the census data from the US government only for Minnesota state. And we did it at county by county level.
<h5>Steps to Reproduce </h5>
1. From the dimnesions pane, drag State and Country to anywhere in the open are of worksheet or Details pane. Tableau will genrate the latitude and longitude metric for these
2. Filter County by deselecting "All Counties"
3. From the measures pane, drag Population Change to Color pane. Change the aggregation to Sum
4. The final graph should look like the one below

##### Interpretation
Blue represents increase and orange represents decrease.
We actually see that population around casino has been declining or flat since 2010. Meanwhile, we see the population in the twin cities area has exponentially increased, which you can see in the dark blue part on this map, and this area is mostly between 60 to 120 mile from Mille Lacs.
#### Graph 2: Change in Average Household Income across Minnesota
##### Explanation
We pulled out the State household income by county level to see wht's the trend of income change during these years. We also pulled additional data about employment.
<h5>Steps to Reproduce </h5>
1. From the dimnesions pane, drag State and Country to anywhere in the open are of worksheet or Details pane. Tableau will genrate the latitude and longitude metric for these
2. Filter County by deselecting "All Counties"
3. From the measures pane, drag household income to Color pane. Change the aggregation to Sum
4. The final graph should look like the one below

##### Interpretation
Good news is the average household income across this state is actually slightly up for most counties though there are some outliers out there. And in the employment data we found actually there are a number of people who are no longer looking for jobs in these counties. But this graph doesn't account for those people. So if we take those people into consideration, the mean household income are actually flat or down.
### Coupon Issuance and Redeemption Analysis
<h4><b>Graph 1, 2: Distance-based recommendations (I)</b></h4>
<h5><b>Explanations</b></h5>
Graph 1, 2 are created from the player_Dim_Day.csv which is the merge of PlayerDay.csv and Dimplayer.csv(the one we generated in the data engineering step). After plotting out the distance of players to casinos, we found 86% of players comes from the 120 miles radius, so we limit our analysis to the players from the region, and divide them into distance groups: 0-45, 45-60, 60-90, and 90-120 miles based on driving time and city/ rural cut-off points.
Graph 1 shows the overall player headcounts over distance groups, the number is accumulated headcounts over 2016-2019.
Graph 2 shows the player spending over distance groups, the number is the average theo per player per visit, and is calculated across 2016-2019 data.
##### Steps to reproduce
Data for graph 1 and graph 2 is calculated on python, and the graphs are plotted on excel.
1. Connect to the data
```python=
player_Dim_Day = pd.merge(PlayerDay, DimPlayer, on='PlayerID', how='left')
DimPlayer = pd.read_csv('\\DimPlayer.csv',
low_memory = False)
```
2. Transfer date column to right format and merge 2 tables together
```python=
# Transfer Date_Value column into datetime format for following analysis
PlayerDay['Date_Value'] = PlayerDay['Date_Value'].astype(str)
PlayerDay['Date_Value'] = pd.to_datetime(PlayerDay['Date_Value'])
# Extract year value from time column
PlayerDay['Date_Year'] = PlayerDay['Date_Value'].map(lambda x: x.year)
player_Dim_Day = pd.merge(PlayerDay, DimPlayer, on='PlayerID', how='left')
```
3. Extract data of players from 120-mile radius
```python=
# Limit the analysis on players living near by
player_dist_less_120 = player_Dim_Day[((player_Dim_Day['DistFromML'] > 0)
& (player_Dim_Day['DistFromML'] <= 120))
& ((player_Dim_Day['DistFromHK'] > 0) &
(player_Dim_Day['DistFromHK'] <= 120))]
```
4. Calculate headcount per distance group and plot using excel
```python=
# Headcount per distance group
player_dist_less_120['PlayerID'].count()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 0) &
(player_dist_less_120['DistFromML'] <= 45)]['PlayerID'].count()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 45) &
(player_dist_less_120['DistFromML'] <= 60)]['PlayerID'].count()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 60) &
(player_dist_less_120['DistFromML'] <= 90)]['PlayerID'].count()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 90) &
(player_dist_less_120['DistFromML'] <= 120)]['PlayerID'].count()
```

5. Calculate spending per group and plot using excel
```python=
# Calculate spending per distance group: <45
player_dist_less_120[(player_dist_less_120['DistFromML'] > 0) &
(player_dist_less_120['DistFromML'] <= 45)]['Actual'].mean()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 0) &
(player_dist_less_120['DistFromML'] <=45)]['Theo'].mean()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 0) &
(player_dist_less_120['DistFromML'] <=45)]['ADT'].mean()
# Calculate spending per distance group: 45-60
player_dist_less_120[(player_dist_less_120['DistFromML'] > 45) &
(player_dist_less_120['DistFromML'] <= 60)]['Actual'].mean()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 45) &
(player_dist_less_120['DistFromML'] <= 60)]['Theo'].mean()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 45) &
(player_dist_less_120['DistFromML'] <= 60)]['ADT'].mean()
# Calculate spending per distance group: 60-90
player_dist_less_120[(player_dist_less_120['DistFromML'] > 60) &
(player_dist_less_120['DistFromML'] <= 90)]['Actual'].mean()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 60) &
(player_dist_less_120['DistFromML'] <= 90)]['Theo'].mean()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 60) &
(player_dist_less_120['DistFromML'] <= 90)]['ADT'].mean()
# Calculate spending per distance group: 90-120
player_dist_less_120[(player_dist_less_120['DistFromML'] > 90) &
(player_dist_less_120['DistFromML'] <= 120)]['Actual'].mean()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 90) &
(player_dist_less_120['DistFromML'] <= 120)]['Theo'].mean()
player_dist_less_120[(player_dist_less_120['DistFromML'] > 90) &
(player_dist_less_120['DistFromML'] <= 120)]['ADT'].mean()
# Calculate spending per distance group: beyond 120
player_Dim_Day[player_Dim_Day['DistFromML'] > 120]['Actual'].mean()
player_Dim_Day[player_Dim_Day['DistFromML'] > 120]['Theo'].mean()
player_Dim_Day[player_Dim_Day['DistFromML'] > 120]['ADT'].mean()
```

<h5> Interpretations </h5>
In graph 1, we can the majority of players coming from within 120 miles are from within 90 miles. 34% of the players comes from less than 45 mile radius, 18% comes from 45-60 miles, and 35% comes from 60-90 miles, and only 5% of players are from 90-120 miles.
In graph 2, we can see that the average theo per player per trip is lower for 0-45 and beyond 120 mile distance group, with $102 and $89 respectively. Players from 45-120 miles range spend more, with $130, $131 and $163 for distance group 45-60, 60-90 and 90-120.
According to previous demogrpahic analysis, we can see that in the nearby region (0-60 miles), the population hasn't been growing, and so does the income and employment status. Adding that to the player distribution shown in this part, we suggest MLCV to maintain the customers within 0-45, 45-60 and focus on players from 60-90 and 90-120 miles radius in order to grow headcounts, for popluations in these regions are growing, and citizens there earn higher income.
<h4><b> Dashboard 1: Distance-based recommendations(II)</b></h4>
<b>Explanations</b>
Dashboard 1 is created from the coupongroup_player_features.csv which is the merge of DimCoupon.csv, CouponGroup.csv and Dimplayer.csv(the one we generated in the data preparation step).
This Dashboard consisted of the following sheets:
Sheet 1: Proportion of different types of coupons issued by Mille Lacs, across different distance groups
Sheet 2: Proportion of coupons redeemed, by coupon type, by the customers to the casinos across different distance groups.
##### Steps to reproduce
1. Connect to the data
- Click "Data Sourse" on the button-left conner
- Select "text file" on the left side
- Choose the coupongroup_player_features.csv
2. Create sheet 1 - Issued coupon proportion across distance group (for Mille Lacs Casino)
- Click the small triangle on the "Dist From ML" variable
- Click "Create" and select "Group"

- Select the range 0-45, 45-60, 60-90, 90-120, >120 respectively and create the corresponding distance group, we take 0-45 as an example here: after you selected the numbers, click "group" and enter '0-45' then click return. Do the same steps for other distance groups.
-


- A new variable "Dist From ML(group)" will appear in the Dimensions part on the left side, drag this variable into filters and select only the groups that we just created

- Drag "Dist From ML(group)" into "Columns"
- Drag "Number of Records" into "Rows"
- Drag "Type Description" into Color card
- Click "Analysis" and then "percentage of", finally click "Column"

- Select all the blue, pink and Orange bars, click the right mouse and then "Mark Label", click "Always show"

- Click anywhere blank, the final graph should be like below

3. Create sheet 2 - Redeemed coupon proportion across distance group (for Mille Lacs Casino)
- Click "Sheet 1" on the bottom, click right mouse and then click "copy"
- Click right mouse on the bottom area and click paste
- Drag "Status x" into the "Filter", choose "R and click "OK"

- The final graph should be like below:

4. Combine Sheet 1 and Sheet 2 into a dashboard to make comparison
- Click "Dashboard" on the bottom
- Drag Sheet 1 and Sheet 2 on the left side to the blank area
- The final dashboard should be like below:

<h5> Interpretations </h5>
From the the plot in Sheet 1, we can see the composition of different coupon types that Mille Lacs sends to its customers based on the distance that they come from.
Here we can see that there is no distance-based segementation, in the way MLCV sends out its coupons.
From Sheet 2, we can see the composition of how customers are redeeming their coupons. From this we can see that coupon redemption varies by their distance from the casino.
Focusing on our target group of 60-120, it can be observed that these customers prefer more of hotel coupons as their distnce from the casino increases. Also, a large proportion of customers redeem the cash coupons. 46.24% of customers within distance group of 60-90 prefer hotel coupons and 44.76% of the remining customers prefer cash coupons. For customers in the 90-120 distance group, 54.86% of them prefer hotel coupons and 33.75% of the remaining customers prefer cash coupons.
Additionally, we can see that gift coupons issued by MLCV are hardly used by these customers.
This analysis would suggest that MLCV would do better in terms of increasing coupon redemption rates if they increased the number of hotel coupons to each distance group accordingly. This would mean a considerable increase in the number of hotel coupons issued by MLCV.
<h5> Value of Impact </h5>
To measure the difference in terms of value added to MLCV by implementing this stragy, we calculated the incresae in coupon redemptions and the approximate monetary value associated with it for our target age group of 60-120 miles, and the following analysis will be based on the ditance to ML and the spending there.
Currently, the number of cash coupons redeemed by this distance group is ``` 269,824```.
The total number of cash coupons issued by this age group is ```7,554,378```.
This gives us the redemption rate as
```
(269,824/7,554,378) * 100 = 3.57%
```
Currently, the total number of gift coupons issued = ```5,800,541```
If we were to convert all these gift coupons to cash coupons, the number of redemption by the redemption rate of coupons
```
(3.57/100) * 5,800,541 = 207,181
```
Since this was calculated with three years of data, the number of redemptions per year would be ```69,060```.
To assign a monetary value to our calculations, we consider the average ADT of customers redeeming cash coupons, which is ```$172```.
Thus, the total ADT for gift coupons converted to cash coupons is
```
69,060 * $115.7 = $7,990,242
```
The average value of cash redeemed by these customers when they use a cash coupon is ```$26.86```. Thus the total redeemed value for all coupons is ```$1,855,757```
This gives a net profit of
```
($7,990,242 - $1,855,757) = $6,134,485
```
Currently, the number of coupons redeemed as gift coupons is ```5,222```. The average ADT of customers using gift coupons is ```$201```.
This makes their total ADT as
```
(5,222/3) * $201 = $349,874
```
The average value of gifts redeemed by these customers when they use a gift coupon is ```$34.56```. Thus the total redeemed value for all coupons is ```$60,157```.
The net difference in overall profit with this replacement of gift coupon by cash coupons is
```
$349,874 - $60,157 = $289,717
```
<h4> Dashboard 2: Age-based recommendations(I)</h4>
<h5>Explanations</h5>
This Dashboard consisted of the following sheets:
Sheet 1: Age distribution of the customers at the casino
Sheet 2: Average spent of the customers by age group
<h5> Steps to reproduce </h5>
Create sheet 1:
1. Connect to the data
* Click “Data Sourse” on the button-left conner
* Select “text file” on the left side
* Choose the DimPlayer.csv
2. Age distribution of MLCV's customer base (for Mille Lacs Casino)
* Click the small triangle on the “Birth Year” variable
* Click “Create” and select “Calculated Field”

Do the following calculation and change the variable name. A new variable "Age" will appear in the Measures part on the left side.

3. Drag and drop the newly created Age variable to Columns and Player ID to rows.
* Click on the triangle next to PlayerID
* Click on Measure in the dropdown
* Choose Count(Distinct)

4. Filter the values of Age
* Click on the triangle next to Age
* Age will get added to the filters segment on the left pane. Click on Edit Filter in the dropdown.
* Choose Range Of Values and enter the range as 17 to 90
* Click on OK

Create sheet 2:
1. Connect to the data
* Click “Data Sourse” on the button-left conner
* Select “text file” on the left side
* Choose the DimPlayer.csv
* Click on add source and join PlayerDay using PlayerID field

2. Create the age variable
* Click the small triangle on the “Birth Year” variable
* Click “Create” and select “Calculated Field” as described earlier
3. Drag and drop the newly created Age variable to Columns and Player ID to rows.
* Click on the triangle next to PlayerID
* Click on Measure in the dropdown
* Choose Average

4. Filter the values of Age
* Click on the triangle next to Age
* Age will get added to the filters segment on the left pane. Click on Edit Filter in the dropdown.
* Choose Range Of Values and enter the range as 17 to 90
* Click on OK
<h5>Interpretations</h5>
The age distribution in Sheet 1 was the primary motivation for our approach to segment customers by age.

Here we can see that there is a large influx of customers belonging to the age range of 50 to 75. However, the number of customers in the younger age group between 20 to 40 was considerably lesser in count.
As this looks anomalous, we compared MLCV's Casino data with the age composition of other Casinos in the United States to understand if this was an expected trend.

As can be seen from the image, according to the industry trends, casino-goers in the age range of 30-49 constitute 38% of the total population at Casino as compared to the 25% at MLCV Grand Casinos.
Further, before we decided to target these customers, we wanted to understand if targetting these customers were worth the efforts. So, we analyzed the plot of the average money spent by these customers by the avgerage amount of total bet placed by customers at Grand Casinos(Sheet 2).

Here we can see that customers in the age range of 40-50 spent the most compared to all the other ages of customers.
From these analysis, we have identifies the age groups of 30-49 as our target group of customers.
<h5> Value of Impact </h5>
To measure the difference in terms of value added to MLCV by implementing this stragy, we calculated the incresae in coupon redemptions and the approximate monetary value associated with it for our target age group of 30-49.
Currently, the number of cash coupons redeemed by this age group is ```93,496```.
The total number of cash coupons issued to this age group is ```2,538,287```.
This gives us the redemption rate as
```
= (93,496/2,538,287) * 100 = 3.68%
```
Currently, the total number of gift coupons issued = ```2,065,342```
If we were to convert all these gift coupons to cash coupons, the number of redemption by the redemption rate of coupons
```
= (3.68/100) * 2,065,342 = 74,352
```
Since this was calculated with three years of data, the number of redemptions per year would total ```24,784```.
To assign a monetary value to our calculations, we consider the average ADT of customers redeeming cash coupons, which is ```$172```.
The total ADT for gift coupons converted to cash coupons is
```
24,784 * 172 = $4,262,848
```
The average value of cash redeemed by these customers when they use a cash coupon is ```$26.86```. Thus, the total redeemed value for all coupons is ```$665,698```.
This gives a net profit of
```
= $4,262,848 - $665,698
= $3,597,150 .
```
Currently, the number of coupons redeemed as gift coupons is ```2,016```. The average ADT of customers using gift coupons is ```$201```.
This makes their total ADT as
```
= (2,016/3) * $201
= $135,072
```
The average value of gifts redeemed by these customers when they use a gift coupon is ```$34.56```. Thus the total redeemed value for all coupons is ```$23,224```
The net difference in overall profit with this replacement of gift coupon by cash coupons is
```
= $3,597,150 - $23,224
= $3,573,926
```
<h4> Dashboard 3: Age-based recommendations(II)</h4>
<h5>
Explanations
</h5>
Dashboard 3 is created from the coupongroup_player_features.csv which is the merge of DimCoupon.csv, CouponGroup.csv and Dimplayer.csv(the one we generated in the data preparation step).
This Dashboard consisted of the following sheets:
Sheet 1: Proportion of different types of coupons issued by Mille Lacs, across different age groups
Sheet 2: Proportion of coupons redeemed, by coupon type, by the customers to the casinos for different age groups.
##### Steps to reproduce
1. Connect to the data
- Click "Data Source" on the button-left corner
- Select "text file" on the left side
- Choose the coupongroup_player_features.csv
2. Create sheet 1 - Issued coupon proportion across age group (for Mille Lacs Casino)
- Click the small triangle on the "Birth Year" variable
- Click "Create" and select "Calculated Field"

Do the following calculation and change the variable name,
A\a new variable "Age" will appear in the Measures part on the left side

- Click the small triangle on the "Age" variable
- Click "Create" and select "Group"
- Select the range 18-29, 30-49, 50+ respectively and create the corresponding age group, the same as what we did to get the distance group in previous steps (We only consider people who are from 18 to 95 years old here based on the domain knowledge)
- A new variable "Age(group)" will appear in the Dimensions part on the left side, drag this variable into filters and select only the groups that we just created

- Drag "Age(group)" into "Columns"
- Drag "Number of Records" into "Rows"
- Drag "Type Description" into Color card
- Click "Analysis" and then "percentage of", finally click "Column"
- Select all the blue, pink and Orange bars, click the right mouse and then "Mark Label", click "Always show"

- Click anywhere blank, the final graph should be like below

3. Create sheet 2 - Redeemed coupon proportion across age group (for Mille Lacs Casino)
- Click "Sheet 1" on the bottom, click right mouse and then click "copy"
- Click right mouse on the bottom area and click paste
- Drag "Status x" into the "Filter", choose "R and click "OK"

- The final graph should be like below:

4. Combine Sheet 1 and Sheet 2 into a dashboard to make comparison
- Click "Dashboard" on the bottom
- Drag Sheet 1 and Sheet 2 on the left side to the blank area
- The final dashboard should be like below:

<h5>Interpretations</h5>
From the the plot in Sheet 1, we can see the composition of different coupon types that Mille Lacs sends to its customers based on their age.
Here we can see that there is no segementation based on age, in the way MLCV sends out its coupons.
From Sheet 2, we can see the composition of how customers are redeeming their coupons. From this, it is evident that the trend of coupon redemption varies by age.
Focussing on our target group of 30-49, it can be observed that these customers prefer more of gift and Cash coupons. Also, it is starkly visible that a large number of gift coupons issued by MLCV is hardly used by these customers.
This analysis would suggest that MLCV would do better in terms of increasing coupon redemption rates if they reduced the number of gift coupons issued and allocated the funds to giving out more Cash coupons.
<h5> Value of Impact </h5>
To measure the difference in terms of value added to MLCV by implementing this stragy, we calculated the incresae in coupon redemptions and the approximate monetary value associated with it for our target age group of 30-49.
Currently, the number of coupons redeemed by this age group is ``` 93496```.
The total number of cash coupons issued by this age group is ```2538287```.
This gives us the redemption rate as
```
(93496/2538287) * 100 = 3.68%
```
Currently, the total number of gift coupons issued = ```2065342```
If we were to convert all these gift coupons to cash coupons, the number of redemption by the redemption rate of coupons
```
(3.68/100) * 2065342 = 74352
```
Since this was calculated with three years of data, the number of redemptions per year would be ```24784```.
To assign a monetary value to our calculations, we consider the average ADT of customers redeeming cash coupons, which is ```$172```.
Thus, the total ADT for gift coupons converted to cash coupons is
```
24784 * 172 = $4,262,848
```
The average value of cash redeemed by these customers when they use a cash coupon is ```$26.86```. Thus the total redeemed value for all coupons is ```$665698```
This gives a net profit of
```
($4,262,848 - $665698) = $3,597,150
```
Currently, the number of coupons redeemed as gift coupons is 2016. The average ADT of customers using gift coupons is ```$201```.
This makes their total ADT as
```
(2016/3) * $201 = $135,072
```
The average value of gifts redeemed by these customers when they use a gift coupon is ```$34.56```. Thus the total redeemed value for all coupons is ```$23224```.
The net difference in overall profit with this replacement of gift coupon by cash coupons is
```
$3,597,150-$23224 = $3,573,926
```
<h3>Customer Review Analysis </h3>
As we are building our focus towards a younger age target segment, we also wanted to look at what people actually feel about the casinos. As per the research conducted by GlobalWeb Index about 82% of the reviews written online are contributed by people below 55 years of age which is similar to the target segment we want to focus on.

We first looked at the casinos overall rating and it's main competitors ratings to check where we are in perspect of online reviews.

We found that although by a small margin but Mille Lacs has the lowest rating among it's main competitors viz. Mystics Lake, Tresure Island Casino and Black Bear (total number of reviews can be seen next to the ratings). We further found that ours has the least number of online reviews that could be attributed to either higher age group of our casino goers or low response rate from the review team.

As can be observed from the graph above, the percentage of positive ratings are decreasing over the course of 6 months and percentage of negative reviews are observed to be following a upward trend too. This further motivated us to look for the reasons for Mille Lacs downhill track in regards to the online sentiments.
Therefore, we web scraped the online reviews and cleaned and tokenized them to be used for further analysis.
``` python=
import nltk
from nltk.corpus import stopwords
import pandas as pd
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import os
import re
import numpy as np
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from operator import itemgetter
nltk.download('punkt')
import docx
from docx.shared import Inches
from textblob import TextBlob
# os.chdir("C:/Users/nitin/Documents/EDA/Live_Case")
######################Reading Data################
doc = docx.Document()
doc.add_heading('Meeting Notes - Global Executive Dashboard', 0)
doc.add_heading('Summary', 1)
df = pd.read_csv("reviews.csv", encoding = 'unicode_escape')
#df = open('transcript.txt').read().strip()
df.head()
######## Sentiment Analysis #############
for i in range(len(df)):
blob = TextBlob(df.loc[i,'review'])
df.loc[i,'sentiment'] = blob.sentiment.polarity
# Not giving better results;
# since we have ratings then we can directly use them as
# indicator of customers sentiment
for i in range(len(df)):
if (df.loc[i,'Rating']=='4/5' or
df.loc[i,'Rating']=='5/5' or
df.loc[i,'Rating']=='8/1'):
df.loc[i,'sentiment_rating'] = 'Positive'
elif (df.loc[i,'Rating']=='3/5',df.loc[i,'Rating']=='2/5' or
df.loc[i,'Rating']=='1/5' or df.loc[i,'Rating']=='0/1' ):
df.loc[i,'sentiment_rating'] = 'Negative'
### Network X and fetching important reviews
## Only taking the postive reviews
from nltk.tokenize import sent_tokenize
sentences_pos = []
for s in df[df.sentiment_rating=='Positive']['review']:
sentences_pos.append(sent_tokenize(s))
sentences_pos = [y for x in sentences_pos for y in x] # flatten list
sentences_pos[:5]
## Only taking the negative ones
sentences_neg = []
for s in df[df.sentiment_rating=='Negative']['review']:
sentences_neg.append(sent_tokenize(s))
sentences_neg = [y for x in sentences_neg for y in x] # flatten list
sentences_neg[:5]
#https://nlp.stanford.edu/projects/glove/
# Extract word vectors
os.getcwd()
word_embeddings = {}
f = open('glove.6B.100d.txt', encoding='utf-8')
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
word_embeddings[word] = coefs
f.close()
len(word_embeddings)
#We now have word vectors for 400,000 different
#terms stored in the dictionary – ‘word_embeddings’.
#Text Preprocessing
# remove punctuations, numbers and special characters
clean_sentences_pos = pd.Series(sentences_pos).str.replace("[^a-zA-Z]", " ")
clean_sentences_neg = pd.Series(sentences_neg).str.replace("[^a-zA-Z]", " ")
# make alphabets lowercase
clean_sentences_pos = [s.lower() for s in clean_sentences_pos]
clean_sentences_neg = [s.lower() for s in clean_sentences_neg]
n=len(sentences)
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
# function to remove stopwords
def remove_stopwords(sen):
sen_new = " ".join([i for i in sen if i not in stop_words])
return sen_new
# remove stopwords from the sentences
clean_sentences_pos=[remove_stopwords(r.split())for r in clean_sentences_pos]
clean_sentences_neg=[remove_stopwords(r.split())for r in clean_sentences_neg]
```
We did the initial cleaning in Python (as need to use a Python based algorithm for one of the part of analysis) but we made the wordcloud in R as it was more convenient to implement.
```{r}
#install.packages('tm')
library(tm)
#install.packages('wordcloud')
library(wordcloud)
#install.packages('SnowballC')
library(SnowballC)
#install.packages('devEMF')
library(devEMF)
#install.packages('RWeka')
library(RWeka)
install.packages("rJava")
#Please change JAVA Home path according to your system
#Sys.setenv(JAVA_HOME='C:\\Program Files\\Java\\jre7')
# for 64-bit version
Sys.setenv(JAVA_HOME='C:/Program Files (x86)/Java/jre1.8.0_231')
# for 32-bit version
install()
library(rJava)
raw_data = read.csv(file.choose(), header = T, stringsAsFactors = T)
claim_data = raw_data
#View(claim_data)
names(claim_data) <- c( "rating", "reviews")
claim_data$rating <- as.numeric(claim_data$rating)
#Provide value of "n" for n-gram tokenization
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1,max = 2))
#Removing records with missing loss amount
claim_data<-claim_data[claim_data$rating>0,]
#Removing records with missing loss description
claim_data<-(claim_data[claim_data$reviews!="",])
#Removing special characters
claim_data$reviews <- sapply(claim_data$reviews,
function(row) iconv(row, "latin1", "ASCII", sub=""))
#Creation of corpus using loss description
corpus <- Corpus(VectorSource(claim_data$reviews))
#Data clean up
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removeNumbers)
#corpus= tm_map(corpus, stemDocument)
corpus <- tm_map(corpus,
removeWords,
c("vehicle accident", "insured vehicle", "v", "wc", "w", "will", "went", "work",
"womens","zk", "yr", "ne", "ws", "year","b","fire", "ee","employee","pain","IV",
"iv","cli","str","CLIENT","CLMT","DIV","client","insd","ov","c","re","alleges",
"int","slo","div","leav","leavin","i","l","cv","rear","end","readend",
"readended","alledges",'ice show'))
corpus <- VCorpus(VectorSource(corpus))
newdata <- data.frame(text = unlist(lapply(sapply(corpus, '[', "content"),
paste, collapse="\n")), stringsAsFactors=FALSE)
newdata$value = claim_data$rating
#Creation of term document frequency matrix
#corpus <- VCorpus(VectorSource(corpus))
tdm <-TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
freq <- slam::row_sums(tdm)
words <- names(freq)
#Creating a data frame having words and frequencies
word<-data.frame(words,freq)
#View(word)
word <- subset(word, freq >3)
#Iterate over each word pair to calculate its count and severity
occ<-rep(1,nrow(newdata))
for (i in 1:nrow(word))
{
#Calculating the weight of each word according to the value.
#Weight will determine the color of the word
#Mean, Max or Sum function can be used as per the focus area
word$weight[i]<-mean(newdata$value[which(apply(newdata, 1,
function(x) any(grepl(word$words[i], x))))])
word$total[i]<-sum(newdata$value[which(apply(newdata, 1,
function(x) any(grepl(word$words[i], x))))])
word$occurence[i]<-sum(occ[which(apply(newdata, 1,
function(x) any(grepl(word$words[i], x))))])
if (i%%1==0){
print (i)
}
}
#Removing irrelevant words
word <- subset(word, occurence >3)
#Saving the word along with count and average loss
write.csv(word,"word_list.csv")
#Storing the quantile values
quant=quantile(word$weight, c(.25, .50, .75,1))
tone<-word$weight
toneR<-word$weight
toneG<-word$weight
toneB<-word$weight
toneR[tone <= quant[1]] <- 0.16
toneG[tone <= quant[1]] <- 0.75
toneB[tone <= quant[1]] <- 0.23
toneR[tone> quant[1] & tone <=quant[2]] <- 1
toneG[tone> quant[1] & tone <=quant[2]] <- 0.59
toneB[tone> quant[1] & tone <=quant[2]] <-0
toneR[tone> quant[2] & tone <=quant[3]] <- 0.07 #17
toneG[tone> quant[2] & tone <=quant[3]] <- 0.59 #150
toneB[tone> quant[2] & tone <=quant[3]] <-0.84 #214
toneR[tone> quant[3] & tone <=quant[4]] <- 1
toneG[tone> quant[3] & tone <=quant[4]] <- 0.34
toneB[tone> quant[3] & tone <=quant[4]] <- 0.34
#Defining the colour on RGB scale for each quantile
c_tone<-rgb(toneR,toneG,toneB)
quant
#Exporting the word cloud image
png(file="Sample_WordCloud.png")
wordcloud(word$words,scale=c(5,1),
word$freq, colors=c_tone, ordered.colors=TRUE, random.color=FALSE,
max.words=200, random.order=FALSE,rot.per=0.25,use.r.layout=FALSE)
dev.off()
#To be used for legends
print (quant)
```

Here the size of the word reperesents the frequency of occurrence and color represents the average rating they are associated with. Rooms, Fun,Staff, Hotel, Buffet, Smoking, Stay, Slots, Service are the most talked about topics.

To get a better idea about their sentiment we looked the percentages of the times each of these highly frquent words are associated with positive and negative sentiments.
We found that Rooms, Buffet, Staff, Smoking and Slots have the highest negative sentiments asscoiated with their usage.
In order to undestand further about the specific things customers have issues with, we looked at the most common tags within each review.

The common theme we observed around most the reviews was smoking. Customers mentioned the cigarrette around moke to be problem in casino, room (even in non smoking rooms) and ecen at the Buffet area. Although customers reported other problems as well but smoking seems to be most imminent of all.
Further, we found that Mille Lacs does not respond to the customers feedbacks on Google revoew section which is unlike it's competitors where we found that each of the negative review is responded to assure better customer recognition.
We recommend Mille Lacs to hire a Social Media Administrator to build a better online presence for Mille Lacs which would help the company to increase the customers engagement and build a deep relationship with customers.
<h2> 7.Results, Findings and Insights </h2>
<h3>Results and Findings</h3>
Through the aforementioned analyses, the team found that:
<u>Coupon Analysis:</u>
* Headcounts have gone down as a result of the decrease in the frequency of visits of regular customers
* There is a stark difference between the issuance and redemption of coupons overall. This difference becomes get clearer when looked through the lenses of different age groups and distance groups
* Gift coupons are rarely redeemed by any age group or distance group
* People who live farther, have more proportion of hotel coupon redemptions vs others
* Cash coupons are the second favored types of coupons by the players at casino
<u>Demographic Analysis:</u>
* Looking into the Demographics and economic statistics in the Minneapolis region, it can be seen that the region's population has gradually shifted away from the casino to the metro area. They have also got older as seen by the median age.
* The median salary in the region also gives a picture of the disproportionate spending power of customers in the region
<u>Review and Feedback Analyses:</u>
* Looking into the Google public reviews of Mille Lacs and other casinos in the region shows that Mille Lacs Grand Casinos are rated the lowest amongst all
* People often complain about smoking and other room related issues in the casino, while praise pool and other facilities.
<h2> 8. Future Steps </h2>
The potential future work that can be applied from this analysis includes:
* Looking into the Gift coupons issued data to identify what different offerings are provided as gifts and why are those coupons redeemed the least?
* Exploring the casino-based hotel study to triangulate the amount of gains achievable per star rating increase in the casino
* Exploring what are the common topics and issues pointed out in the reviews of the competitors in the region as well as some other casinos in the country
* Expanding the analyses to include the impact of employing a social media administrator and the lift in the customer ratings or revenue
<h2> 9. Challenges Faced </h2>
Through the course of this project, the team faced below challenge and suggests the measures to get around those in future:
1. The size of dataset poses a big challenge on the number and depth of analyses that can be performed. For best results and ease of analyses,
* Use data in chunks to analyze
* Python:
* Import data as chunks and merge at different partitions
* Tableau:
* Create data extracts for working with static data to reduce the visualization time
* Use recommended environment (processable memory and storage ) at all times and kill the variables not in use to clear physical memory and storage
* Python:
* Use variable explorer to view the variables in memory. Check the size occupied by each variable using
```
from sys import getsizeof
getsizeof(a)
```
* Delete variables not in use
```
del(variable_name)
```
* Tableau:
* Remove worksheets and dashboards not in use to avoid significant loading time
* Turn off automatic update on vies and dashboard to avoid repeat reloads on every action
* R:
* Use gc() as a garbage collection step after in-depth analysis to clear memory
* Use Big Data platforms for analyzing and visualizing data
* Pyspark
* SparkR
* AWS
* Google Cloud Platform
<h2>
10. References
</h2>
<li> eMarketer. (March 21, 2013). Profile of Generation X internet users in the United States in 2012 (in millions) [Graph]. In Statista. Retrieved December 12, 2019, from https://www.statista.com/statistics/298589/us-gen-x-internet-users/ </li>
<li> Harvard Business Review (February 14, 2018). Study: Replying to Customer Reviews Results in Better Ratings. By:
Davide Proserpio and Giorgos Zervas from https://hbr.org/2018/02/study-replying-to-customer-reviews-results-in-better-ratings</li>
<li>Anderson, C. (2012). The impact of social media on lodging performance [Electronic article]. Cornell Hospitality Report, 12(15), 6-11. from https://scholarship.sha.cornell.edu/chrpubs/5/</li>
<li>Google 'Read and Reply to Reviews' https://support.google.com/business/answer/3474050?co=GENIE.Platform%3DAndroid&hl=en</li>
<li>Alliance Data. (February 19, 2019). Share of consumers who use selected methods to communicate with brands in the United States as of November 2018, by generation [Graph]. In Statista. Retrieved December 03, 2019, from https://www.statista.com/statistics/1044250/consumer-communication-methods-brands-generation-us/</li>
<li>Trip Advisor 'Management Response Guidelines', https://www.tripadvisorsupport.com/hc/en-us/articles/200614337-Management-Response-Guidelines</li>
<li>GlobalWebIndex. (December 11, 2017). Percentage of global internet users who post reviews online as of 3rd quarter 2017, by age group [Graph]. In Statista. Retrieved December 12, 2019, from https://www.statista.com/statistics/307050/online-adults-posting-product-or-brand-reviews-on-the-internet-by-generation/</li>
<li>Statista. (October 31, 2018). Share of Americans who played blackjack, poker, roulette, etc. in casinos in the last 12 months in 2018, by age [Graph]. In Statista. Retrieved December 12, 2019, from https://www.statista.com/statistics/227478/table-game-players-blackjack-craps-poker-roulette-etc-usa/</li>
<li>RubinBrown. (April 2, 2019). Total revenue of the gambling market in the United States from 2004 to 2018 (in billion U.S. dollars) [Graph]. In Statista. Retrieved December 12, 2019, from https://www.statista.com/statistics/271583/casino-gaming-market-in-the-us/</li>
<h2> 11. Appendices </h2>
Last but not least. We'd like to dedicate this feat of analytical brilliance to our team mascot and morale booster, Pang Pang after whom we named ourselves.

Thanks! ~ Team Pang Cake