STAT 172 Project 1

# STAT 172 Project - GLM Logistic Regression of Highly Rated Amazon Products Group members: Kendall Starcevich, Olivia Soppa, Munkh-Orgil Tumurchudur ## Code and Meaningful Outputs Below: ## Cleaning Data #### Importing Data ```sas! /* IMPORT AND READ DATA */ proc import out = amazon datafile = "/home/u63558812/amazon_products_sales_data_uncleaned.csv" dbms = csv replace; guessingrows = MAX; run; /*only keep columns we plan to use */ data amazon; set amazon; keep number_of_reviews listed_price 'current/discounted_price'n is_best_seller is_sponsored rating is_couponed buy_box_availability; run; /*check data types*/ proc contents data = amazon; run; ``` Here are the first 5 rows of the dataset. We are predicting 'rating', but will have to make this categorical. We also decided to make listed_price, is_best_seller, is_couponed, and buy_box_availability binary. ![Head of Data](https://hackmd.io/_uploads/rkAhAHnsgx.png) You can also see below from the results of ```proc contents``` that all of the variables are Char, even when we want some of them to be Num. The code for cleaning is below. ![Proc Contents](https://hackmd.io/_uploads/H1IV1U3iee.png) #### Cleaning Code *NOTE: some of the cleaning code below was written with the assistance of Chat GPT. The usage is sited in the “AI_use_documentation.pdf* ```sas! /* CLEANING */ /* ------------ Cleaning number_of_reviews -------------- */ /* Checkout values */ proc freq data = amazon; tables number_of_reviews; run; /* removes comma, makes num */ data amazon; set amazon; clean_number_reviews = input(compress(number_of_reviews, ","), best.); run; /*confirm that it is num*/ proc contents data = amazon; run; /* now called clean_number_reviews */ /*------------ cleaning listed_price --------------- */ /* check for null values*/ proc freq data=amazon; tables listed_price / missing; run; /*make binary*/ data amazon; set amazon; if listed_price = "No Discount" then is_discount = "No Discount"; else is_discount = "Discount"; run; /*now called is_discount*/ /* -------------- Cleaning current/discount_price -------------- */ /*check for missing*/ proc freq data = amazon; tables 'current/discounted_price'n / missing; run; /*convert to numeric*/ data amazon; set amazon; clean_current_discounted_price = input('current/discounted_price'n, best.); run; /*drop missing*/ data amazon; set amazon; if clean_current_discounted_price ne .; run; /* confirm missing were dropped */ proc freq data=amazon; tables clean_current_discounted_price / missing; run; /* confirm it is now numeric */ proc contents data=amazon; run; /*now called clean_current_discounted_price */ /*------------------ clean is_best_seller ----------------------*/ /*checkout is_best_seller*/ proc freq data = amazon; tables is_best_seller; run; /*make binary*/ data amazon; set amazon; is_best_seller_clean = is_best_seller; if is_best_seller in ("Amazon's", "Best Seller") then is_best_seller_clean = "Best Seller"; else is_best_seller_clean = "No"; run; /*now called is_best_seller_clean*/ /*----------------- clean is_sponsored -------------------*/ /*checkout values*/ proc freq data=amazon; tables is_sponsored / missing; run; /*it is already clean with no nulls*/ /*--------------- cleaning rating (y var) -----------------*/ /* convert to numeric */ data amazon; set amazon; rating_num = input(scan(rating, 1, ' '), best.); run; /* drop na's */ data amazon; set amazon; if rating_num ne .; run; /* make binary */ data amazon; set amazon; if rating_num >= 4.5 then rating_cat = 1; else rating_cat = 0; run; /*now called rating_cat*/ /*---------------------- clean is_couponed ------------------------*/ /*check out values*/ proc freq data = amazon; tables is_couponed; run; /*make binary*/ data amazon; set amazon; is_couponed_clean = is_couponed; if is_couponed = "No Coupon" then is_couponed_clean = "No"; else is_couponed_clean = "Yes"; run; /*now called is_couponed_clean */ /*----------------- clean buy_box_availability -------------------*/ /*checkout values*/ proc freq data=amazon; tables buy_box_availability / missing; run; /*make binary*/ data amazon; set amazon; buy_box_clean = buy_box_availability; if buy_box_availability = "Add to cart" then buy_box_clean = "Yes"; else buy_box_clean = "No"; run; /*now called buy_box_clean */ ``` Here are the results of cleaning: ![Clean Head](https://hackmd.io/_uploads/HyM4fUnjlg.png) You can also see that the variables are the correct datatypes. ![Clean Proc Contents](https://hackmd.io/_uploads/HkvLGLhigx.png) ## Pre-Anlysis Check for Complete Separation: Output and Complete Analysis Before making the model, we must plot each of the x-vars to check for Complete Separation. The categorical vars will be a table, and the quantitative vars will be a plot. ```sas! /* clean_number_reviews */ proc sgplot data = amazon; scatter x = clean_number_reviews y = rating_cat; run; ``` ![Screenshot 2025-09-20 at 11.32.55 AM](https://hackmd.io/_uploads/r14HwL3iee.png) We can tell from this plot that all products with more than around 170,000 reviews are highly rated (rating of 4.5 or above). ```sas! /*is_discount*/ proc freq data = amazon; tables is_discount*rating_cat; run; ``` ![Screenshot 2025-09-20 at 11.40.44 AM](https://hackmd.io/_uploads/ryFrKLnogg.png) There does not seem to be complete separation because there are no 'empty cells'. There is a very similar percentage between highly rated vs not highly rated in both discount and no discount. Both around 42-43% low rated and 56-57% high rated. This leads us to believe that this variable might not be too helpful in predicting our y. ```sas! /*clean_current_discounted_price*/ proc sgplot data = amazon; scatter x = clean_current_discounted_price y = rating_cat; run; ``` ![Screenshot 2025-09-20 at 11.42.14 AM](https://hackmd.io/_uploads/Bk7_F83slx.png) There doesn't seem to be a visible pattern here, but no complete separation because the 1s and 0s seem evenly distributed. ```sas! /*is_best_seller_clean*/ proc freq data = amazon; tables is_best_seller_clean*rating_cat; run; ``` ![Screenshot 2025-09-20 at 11.46.56 AM](https://hackmd.io/_uploads/Skstq8njll.png) This shows no complete separation. It also shows that 67% of Best Seller's are highly rated, compared to only 57% of not best sellers being highly rated. This makes sense contextually because 'Best Seller' items are likely popular with customers. However, only 3% of products are rated as Best Sellers. ```sas! /*is_sponsored*/ proc freq data = amazon; tables is_sponsored*rating_cat; run; ``` ![Screenshot 2025-09-20 at 11.49.16 AM](https://hackmd.io/_uploads/HJuzo82jex.png) This shows no complete separation. 'Organic' makes up 77% of the products. It also shows that Organic items are highly rated 59% of the time, while Sponsored are only 52%. ```sas! /*is_couponed_clean*/ proc freq data = amazon; tables is_couponed_clean*rating_cat; run; ``` ![Screenshot 2025-09-20 at 11.50.44 AM](https://hackmd.io/_uploads/HkkOiUhoxe.png) There is no complete separation here. 94.4% of the products are not couponed. This shows that 59% of products that are not couponed are highly rated, while only 26% of couponed are highly rated. This will likely be a good predictor. ```sas! /*buy_box_clean*/ proc freq data = amazon; tables buy_box_clean*rating_cat; run; ``` ![Screenshot 2025-09-20 at 11.55.27 AM](https://hackmd.io/_uploads/BJpt2Lhilx.png) There is no complete separation here. You can see most products (90.24%) do have the buy_box button. Of these, 59% are highly rated, while only 40% of products that do not have the buy_box button are highly rated. Therefore, for now **we do not suspect that there is complete separation in any of the X vars.** We will do a post-analysis check after creating the model to confirm. ## Creating the Model ### Code for Model ```sas! proc logistic data = amazon; class is_discount is_best_seller_clean is_sponsored is_couponed_clean buy_box_clean / param = reference; model rating_cat(event = '1') = clean_number_reviews is_discount clean_current_discounted_price is_best_seller_clean is_sponsored is_couponed_clean buy_box_clean / clparm = both; output out=diags predicted = pred xbeta=linpred; run; ``` ### Model Outputs ![Screenshot 2025-09-20 at 12.02.49 PM](https://hackmd.io/_uploads/BJHH0Uhoxx.png) ![Screenshot 2025-09-23 at 11.30.33 PM](https://hackmd.io/_uploads/ByEeEgb2xg.png) ![Screenshot 2025-09-23 at 11.31.12 PM](https://hackmd.io/_uploads/rkoMVlW3lx.png) ![Screenshot 2025-09-20 at 12.03.47 PM](https://hackmd.io/_uploads/rJGF0U3oex.png) ![Screenshot 2025-09-20 at 12.04.43 PM](https://hackmd.io/_uploads/HkI30Lhigl.png) ![Screenshot 2025-09-20 at 12.04.51 PM](https://hackmd.io/_uploads/Sy6nALhige.png) ![Screenshot 2025-09-20 at 12.04.59 PM](https://hackmd.io/_uploads/r1LTRU3sel.png) ![Screenshot 2025-09-20 at 12.05.09 PM](https://hackmd.io/_uploads/r1kRA8hoxg.png) ![Screenshot 2025-09-20 at 12.05.17 PM](https://hackmd.io/_uploads/SyDARUnixl.png)

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.