# STAT1013: Practical Assignment Part 1: Sharing Your Idea and Data ###### Mark: 35 Points This semester, you will complete a practical assignment submitted in **TWO** parts. This handout outlines the *first part*, which involves sharing your idea and data. The primary objective of the assignment is to compare the averages of two population groups. You will collect ==quantitative (ordered) data== *(a response from each subject is a number (can be 0/1)!)* from two groups to compare population means. The samples can be *independent* or *paired*. After data collection, you will share and describe each sample using appropriate graphs and summary statistics from Python. You will then conduct a two-sample t-test or a paired t-test, depending on your data type. The entire assignment is worth 100 points. ## **Coming up with an Idea** (10 points) One of the most challenging aspects of the research process is generating an idea. To make this assignment meaningful, consider topics that interest you or data you have already gathered or can easily access. Remember, you will be collecting quantitative data (not categorical) from two distinct groups. Each group must include at least 30 subjects. Subjects can be individuals (e.g., comparing GPAs between males and females requires 60 individuals) or other entities (e.g., comparing average house prices in Hong Kong and New York City requires at least 60 house prices). Subjects can even be items like food products; if comparing prices at two different stores, you need data on at least 30 products, resulting in 60 prices (30 from each store). You must collect the same type of quantitative data from each subject, with consistent units. For instance, if comparing food prices at two stores, gather the price of each item; if comparing study hours of male and female students, gather data on hours studied from each student. In these examples, food items and students are subjects, respectively. If comparing school districts' average teacher salaries, the subjects are school districts. The data you collect (e.g., food item prices, study hours, teacher salaries) must be quantitative. Categorical data will not allow you to calculate averages or complete this assignment. Note that from the above examples, the data you are collecting (price of food item, number of study hours, salary of teachers) are quantitative (numbers). *If you collect categorical data, you cannot calculate average and cannot complete this assignment*. To help you brainstorm, consider some of these ideas that past students have pursued. Sure, here are some examples specific to Hong Kong: - Do apartments in Hong Kong's Mid-Levels area have higher rental prices than those in Tsim Sha Tsui? - Two groups: Apartments in Mid-Levels vs. apartments in Tsim Sha Tsui - Response variable: Rental prices - Do public transportation users in Hong Kong have shorter average commute times than private car users? - Two groups: Public transportation users vs. private car users - Response variable: Commute times - Do people who live on Hong Kong Island report higher levels of life satisfaction than those who live in the New Territories? - Two groups: Residents of Hong Kong Island vs. residents of the New Territories - Response variable: Reported life satisfaction levels - Do restaurants in Central charge more for a meal than restaurants in Mong Kok? - Two groups: Restaurants in Central vs. restaurants in Mong Kok - Response variable: Meal prices - Do people working in the finance sector in Hong Kong have longer working hours than those in the technology sector? - Two groups: Finance sector employees vs. technology sector employees - Response variable: Working hours - Do residents of Hong Kong Island have a lower rate of car ownership than residents of the New Territories? - Two groups: Residents of Hong Kong Island vs. residents of the New Territories - Response variable: Car ownership rate You don't have to use the above ideas, but they serve as a good starting point; **you should also consider about if you can get the corresponding dataset**. Each example compares two samples based on a quantitative response variable. >[!Tip] Starting with real datasets can be beneficial (refer to the links in the following section). Consider which columns you can use to conduct A/B tests. Answer the following five questions in your write-up that you will submit in **BlackBoard**. - Background and basic description of the dataset (2 points) - Hypothesis 1) (1 points) Please share your research idea with us and explain why you have chosen to pursue it. 2) (2 points) Carefully explain the following: 1. What two groups you are comparing 2. What you will be measuring (i.e., what your response variable will be) 3. Is your response variable quantitative rather than categorical? 3) (2 points) Make a prediction about what kind of difference you expect to see between your samples and WHY. Note that when we gather data in statistics in order to compare samples, we often hypothesize that groups will exhibit some difference. You might anticipate one group's mean to be larger than the other's, or vice versa. If you do not have a specific expectation, you can hypothesize that the two means will simply differ. 4) (1 points) Discuss your data collection methods. For instance, will you obtain data from specific websites, survey friends and classmates, visit different stores to gather information, or focus on the price per ounce instead of sale prices when comparing food items? 5) (2 points) If resources such as time, money, and staff were unlimited, how would you enhance your data collection process as described in Question 4? <!-- Remember that you must have at least 30 cases (people or objects) in **each** of your groups or samples. **IMPORTANT:** As you are working on this part of your project, pay careful attention to the directions in your software reference guide (for the statistical software package you have chosen to use in the course). For each software package, attempts were made to explain just how you should go about entering your data. Look carefully at the section in the reference guide about entering data; it would also be helpful for you to look at the sections on conducting paired and two-sample t-tests. ## **Paired OR two-sample** Some of you may be working on projects in which there is a connection between your groups because you are basically measuring the same case or individual twice, or you have matched pairs. For example, you may have chosen to compare the prices of products sold at different stores. You’ve chosen a sample of 30 products, and you have measured the price of the product twice—once at one store and once at another store. Because you have the exact same products in each group, there is reason to believe that the groups are NOT independent of one another. What a certain product costs at one market could affect what it costs at another market. You will, therefore, end up conducting a paired t-test as a part of your project. The software reference guide on paired t-tests will inform you about how to enter your data. Others may be working on project ideas where you are comparing two completely different groups. For example, you might be comparing prices of apartments in Hong Kong and NYC, or you might be comparing males and females on test scores of some kind. You have independent samples if there are completely different cases or individuals in each sample (or group) and there is no reason to believe that cases or individuals in one group will affect the measurements taken from cases or individuals in the other group. When it comes time to analyze your data, you will end up conducting a two-sample t-test. The software reference guide on two-sample (or independent samples) t-tests will provide examples about how to enter this kind of data. If you are not sure what kind of data you have or you need help entering the data, please let the TA’s know! --> ## **Prepare your dataset** (15 points) The second section of the assignment is to collect data and read the data into Python and list the groups you want to compare. There is a lot of data online, and if you are trying to find something in particular, try searching for data online. You might type in the keyword “data”, "csv", "github", and then other keywords, depending on your interests. Just know that even though you may have an IDEAL project topic in mind that you’d like to pursue, you may not be able to find the appropriate kind of data in the time you have to work on this assignment. - Github Raw CSV datasets - https://github.com/prasertcbs/basic-dataset - https://github.com/Opensourcefordatascience/Data-sets - Public datasets - https://data.gov.hk/en/ - https://github.com/awesomedata/awesome-public-datasets - https://data.world/datasets/csv - https://github.com/curran/data - This is a [tutorial](https://machinelearningmastery.com/handle-missing-data-python/) about missing data, might be helpful if there is missing data issue in your dataset. >[!Note] You will receive an additional bonus if you analyze and clean a dataset from [DATA.GOV.HK](https://data.gov.hk/en/). You should then read your data into Python using the following command – ```python import pandas as pd ## (option 1; recommended) load csv data via github link # Step 1: find the link of csv dataset # Step 2: copy & paste the link into pd.read_csv(...) df = pd.read_csv(<link_to_github_raw_data>) ## (option 2) load data by dowloading the csv # Step 1: download the dataset (*.csv) # Step 2: upload the csv file (in your local machine) to colab # Step 3: copy the path of the csv file # Step 4: load the data via pd.read_csv(path_to_csv) df = pd.read_csv(<path_to_csv>) ``` Answer the following questions in your write-up **Jupyter notebook** that you will submit in **BlackBoard**. 1) (2 points) Tell us what groups you want to compare in the dataset 2) (3 points) Print first 5 records of each group, respectively. 3) (10 points) Any other data description and visualization you want to add. ## **Graphs and Descriptive Statistics (10 points)** It is essential to graph the data and generate summary statistics to analyze characteristics such as shape, center/location, and variability. It is crucial to graph each sample separately and produce distinct tables of descriptive statistics for each sample. For instance, when comparing males and females based on survivial rates, separate graphs displaying the survivial rate distribution for males and females should be created, accompanied by separate summary statistics for each group. This section of the practical assignment, focusing on data visualization and description, will be evaluated as follows: 1. Creation and interpretation of at least one suitable graphs (e.g., boxplot, violinplot, barplot) for each group **(4 points)**. 2. Provision of appropriate summary statistics (measures of center and spread) for each group, with a descriptive analysis of the data **(3 points)**. 3. Discussion of similarities and differences between the groups **(3 points)**.