# STAT 172 Project - GLM Logistic Regression of Highly Rated Amazon Products
Group members: Kendall Starcevich, Olivia Soppa, Munkh-Orgil Tumurchudur
## Code and Meaningful Outputs Below:
## Cleaning Data
#### Importing Data
```sas!
/* IMPORT AND READ DATA */
proc import out = amazon
datafile = "/home/u63558812/amazon_products_sales_data_uncleaned.csv"
dbms = csv replace;
guessingrows = MAX;
run;
/*only keep columns we plan to use */
data amazon;
set amazon;
keep number_of_reviews listed_price 'current/discounted_price'n is_best_seller is_sponsored rating
is_couponed buy_box_availability;
run;
/*check data types*/
proc contents data = amazon;
run;
```
Here are the first 5 rows of the dataset. We are predicting 'rating', but will have to make this categorical. We also decided to make listed_price, is_best_seller, is_couponed, and buy_box_availability binary.

You can also see below from the results of ```proc contents``` that all of the variables are Char, even when we want some of them to be Num. The code for cleaning is below.

#### Cleaning Code
*NOTE: some of the cleaning code below was written with the assistance of Chat GPT. The usage is sited in the “AI_use_documentation.pdf*
```sas!
/* CLEANING */
/* ------------ Cleaning number_of_reviews -------------- */
/* Checkout values */
proc freq data = amazon;
tables number_of_reviews;
run;
/* removes comma, makes num */
data amazon;
set amazon;
clean_number_reviews = input(compress(number_of_reviews, ","), best.);
run;
/*confirm that it is num*/
proc contents data = amazon;
run;
/* now called clean_number_reviews */
/*------------ cleaning listed_price --------------- */
/* check for null values*/
proc freq data=amazon;
tables listed_price / missing;
run;
/*make binary*/
data amazon;
set amazon;
if listed_price = "No Discount" then is_discount = "No Discount";
else is_discount = "Discount";
run;
/*now called is_discount*/
/* -------------- Cleaning current/discount_price -------------- */
/*check for missing*/
proc freq data = amazon;
tables 'current/discounted_price'n / missing;
run;
/*convert to numeric*/
data amazon;
set amazon;
clean_current_discounted_price = input('current/discounted_price'n, best.);
run;
/*drop missing*/
data amazon;
set amazon;
if clean_current_discounted_price ne .;
run;
/* confirm missing were dropped */
proc freq data=amazon;
tables clean_current_discounted_price / missing;
run;
/* confirm it is now numeric */
proc contents data=amazon;
run;
/*now called clean_current_discounted_price */
/*------------------ clean is_best_seller ----------------------*/
/*checkout is_best_seller*/
proc freq data = amazon;
tables is_best_seller;
run;
/*make binary*/
data amazon;
set amazon;
is_best_seller_clean = is_best_seller;
if is_best_seller in ("Amazon's", "Best Seller") then is_best_seller_clean = "Best Seller";
else is_best_seller_clean = "No";
run;
/*now called is_best_seller_clean*/
/*----------------- clean is_sponsored -------------------*/
/*checkout values*/
proc freq data=amazon;
tables is_sponsored / missing;
run;
/*it is already clean with no nulls*/
/*--------------- cleaning rating (y var) -----------------*/
/* convert to numeric */
data amazon;
set amazon;
rating_num = input(scan(rating, 1, ' '), best.);
run;
/* drop na's */
data amazon;
set amazon;
if rating_num ne .;
run;
/* make binary */
data amazon;
set amazon;
if rating_num >= 4.5 then rating_cat = 1;
else rating_cat = 0;
run;
/*now called rating_cat*/
/*---------------------- clean is_couponed ------------------------*/
/*check out values*/
proc freq data = amazon;
tables is_couponed;
run;
/*make binary*/
data amazon;
set amazon;
is_couponed_clean = is_couponed;
if is_couponed = "No Coupon" then is_couponed_clean = "No";
else is_couponed_clean = "Yes";
run;
/*now called is_couponed_clean */
/*----------------- clean buy_box_availability -------------------*/
/*checkout values*/
proc freq data=amazon;
tables buy_box_availability / missing;
run;
/*make binary*/
data amazon;
set amazon;
buy_box_clean = buy_box_availability;
if buy_box_availability = "Add to cart" then buy_box_clean = "Yes";
else buy_box_clean = "No";
run;
/*now called buy_box_clean */
```
Here are the results of cleaning:

You can also see that the variables are the correct datatypes.

## Pre-Anlysis Check for Complete Separation: Output and Complete Analysis
Before making the model, we must plot each of the x-vars to check for Complete Separation. The categorical vars will be a table, and the quantitative vars will be a plot.
```sas!
/* clean_number_reviews */
proc sgplot data = amazon;
scatter x = clean_number_reviews y = rating_cat;
run;
```

We can tell from this plot that all products with more than around 170,000 reviews are highly rated (rating of 4.5 or above).
```sas!
/*is_discount*/
proc freq data = amazon;
tables is_discount*rating_cat;
run;
```

There does not seem to be complete separation because there are no 'empty cells'. There is a very similar percentage between highly rated vs not highly rated in both discount and no discount. Both around 42-43% low rated and 56-57% high rated. This leads us to believe that this variable might not be too helpful in predicting our y.
```sas!
/*clean_current_discounted_price*/
proc sgplot data = amazon;
scatter x = clean_current_discounted_price y = rating_cat;
run;
```

There doesn't seem to be a visible pattern here, but no complete separation because the 1s and 0s seem evenly distributed.
```sas!
/*is_best_seller_clean*/
proc freq data = amazon;
tables is_best_seller_clean*rating_cat;
run;
```

This shows no complete separation. It also shows that 67% of Best Seller's are highly rated, compared to only 57% of not best sellers being highly rated. This makes sense contextually because 'Best Seller' items are likely popular with customers. However, only 3% of products are rated as Best Sellers.
```sas!
/*is_sponsored*/
proc freq data = amazon;
tables is_sponsored*rating_cat;
run;
```

This shows no complete separation. 'Organic' makes up 77% of the products. It also shows that Organic items are highly rated 59% of the time, while Sponsored are only 52%.
```sas!
/*is_couponed_clean*/
proc freq data = amazon;
tables is_couponed_clean*rating_cat;
run;
```

There is no complete separation here. 94.4% of the products are not couponed. This shows that 59% of products that are not couponed are highly rated, while only 26% of couponed are highly rated. This will likely be a good predictor.
```sas!
/*buy_box_clean*/
proc freq data = amazon;
tables buy_box_clean*rating_cat;
run;
```

There is no complete separation here. You can see most products (90.24%) do have the buy_box button. Of these, 59% are highly rated, while only 40% of products that do not have the buy_box button are highly rated.
Therefore, for now **we do not suspect that there is complete separation in any of the X vars.** We will do a post-analysis check after creating the model to confirm.
## Creating the Model
### Code for Model
```sas!
proc logistic data = amazon;
class is_discount is_best_seller_clean is_sponsored is_couponed_clean buy_box_clean / param = reference;
model rating_cat(event = '1') = clean_number_reviews is_discount clean_current_discounted_price is_best_seller_clean is_sponsored is_couponed_clean buy_box_clean / clparm = both;
output out=diags predicted = pred xbeta=linpred;
run;
```
### Model Outputs








