HIN 776: Final Longitudinal Project Instructions

> The following exercise will give you a rudimentary understanding of analysing and visualizing data using Python. Steps you need to complete will be ==highlighted== but please follow along closely to the entire document. This is **Part 1** and will continue next week. ## Data The data set we'll be working with if from the [Cleveland database](https://archive.ics.uci.edu/ml/datasets/heart+Disease) on heart disease. Spend some time exploring the history and documentation on the database prior to continuing to our download instructions. ==**We've done some initial formatting to reduce the number of attributes from 76 to 14 and made it available to download from [Kaggle](https://www.kaggle.com/datasets/sumaiyatasmeem/heart-disease-classification-dataset).**== Navigate there first, download the data and then proceed with the instructiuons. ## Step 1: Search for Features Features are different parts of the data. During this step, you'll want to start finding out what you can about the data. One of the most common ways to do this, is to create a **data dictionary**. : a dictionary of terms and features specific to the dataset ### Heart Disease Data Dictionary A data dictionary describes the data you're dealing with. Not all datasets come with them so this is where you may have to do your research or ask a **subject matter expert** (someone who knows about the data) for more. In this case, there are a number of terms and codes that you probably won't be familiar with. It's a good idea to save the below definitions to a Python dictionary or in an external file so you can easily look at them later. The following are the features we have for the dataset (heart disease or no heart disease): ``` 1. age - age in years 2. sex - 1=male; 0=female 3. cp - chest pain type * 0: Typical angina: chest pain related to decreased blood supply to heart * 1: Atypical angina: chest pain not related to heart * 2: Non-angial pain: typically esophageal spasms (non-heart related) * 3: Asymptomatic: chest pain not showing signs of disease 4. trestbps - resting blood pressurs (in mm Hg on admission to the hospital) * *anything above 130-140 is typically a cause for concern* 5. chol - serum cholestoral in mg/dl * serum = LDL + HDL + .2 * triglycerides * *above 200 is cause for concern* 6. fbs - fasting blood sugar > 120 mg/dl * 1 = true; 0 = false * *>126 mg/dl signals diabetes* 7. restecg - resting electrocardiographic results * 0: nothing to note * 1: ST-T Wave abnormality * can range from mild symptoms to severe problems * signals non-normal heart beat * 2: possible or definite left ventricular hypertrophy * enlarged heart's main pumping chamber 8. thalach - maximum heart rate achieved 9. exang - exercise induced angina (1=yes; 0=no) 10. oldpeak - ST depression induced by exercise relative to rest * looks at stress of heart during exercise * unhealthy heart will stress more 11. slope - the slope of the peak exercise ST segment * 0: Upsloping - better heart rate with exercise (uncommon) * 1: Flatsloping - minimal change (healthy heart) * 2: Downsloping - signs of unhealthy heart 12. ca - number of major vessels (0-3) colored by flouroscopy * colored vessel means the doctor can see the blood passing through * the more blood movement then better (no clots) 13. thal - thalium stress result * 1,3: normal * 6: fixed defect (okay now) * 7: reversible defect - no proper blood movement when exercising 14. target - have diease or not (1=yes; 0=no)(=the predicted attribute) ``` > **Note: No personal identifiable information (PPI) can be found in the dataset** ## Step 2: Preparing the Tools At the start of any project, it's custom to see the required libraries imported in a big chunk. However, in practice, your projects may import libraries as you go. After you've spent a couple of hours working on your problem, you'll probably want to do some tidying up. This is where you may want to consolidate every library you've used at the top of your notebook (like the list below). The libraries you use will differ from project to project. But there are a few which will you'll likely take advantage of during almost every structured data project. ==**Import the following libraries into your notebook using the `import` command and shorthand.**== * [pandas](https://pandas.pydata.org/) for data analysis. * [NumPy](https://numpy.org/) for numerical operations. * [Matplotlib](https://matplotlib.org/)/ [Seaborn](https://seaborn.pydata.org/) for plotting or data visualization. ### Load the data There are many different kinds of ways to store data. The typical way of storing **tabular data**, data similar to what you'd see in an Excel file is in `.csv` format. `.csv` stands for comma seperated values. Pandas has a built-in function to read `.csv` files called `read_csv()` which takes the file pathname of your `.csv` file. You'll likely use this a lot. ==**Load the `heart-disease.csv` as a `pandas` dataframe and check the *dimension* of the dataframe**== ## Step 3: Data Exploration *Exploratory data analysis or EDA* : The process of exploring a dataset to become more familiar Once you've imported a dataset, the next step is to explore. There's no set way of doing this. But what you should be trying to do is become more and more familiar with the dataset. Compare different columns to each other and to the target variable. Refer back to your **data dictionary** and remind yourself of what different columns mean. Your goal is to become a subject matter expert on the dataset you're working with. Since EDA has no real set methodolgy, the following is a short check list you might want to walk through: 1. What question(s) are you trying to solve (or prove wrong)? 2. What kind of data do you have and how do you treat different types? 3. What’s missing from the data and how do you deal with it? 4. Where are the outliers and why should you care about them? 5. How can you add, change or remove features to get more out of your data? Once of the quickest and easiest ways to check your data is with the `head()` function. Calling it on any dataframe will print the top 5 rows, `tail()` calls the bottom 5. You can also pass a number to them like `head(10)` to show the top 10 rows. ==**Print the first *10 rows* and the last *25 rows* of the dataframe**== ### Data distribution From that data, we'll start exploring the columns. Let's start with the column `target`. This column has 2 types of value (*yes* or *no* represented by 1 and 0 respectively). ==**Run a function to determine the distribution of the dataset based on a target value. How many positive (1) and negative (0) samples are in our dataframe. Print the result from each.**== > Hint: use `value_counts()` ### Plotting data We can plot the target column value counts by calling the `plot()` function and telling it what kind of plot we'd like. In this case, bar is a good option. ==**Plot the value counts with a bar graph by using the build-in `plot()` method. Use the samples that belong to the *target column value YES* and *target column value NO***. Then, run the `info()` method on the dataframe.== > **Tip: Checking for missing data** - The method `db.info()` shows a quick insight into the number or missing values you have and the type of data you're working with. If you run it on this dataframe, there should be no missing values and all columns should be numerical in nature Another way to get some quick insights is to use `df.describe()`. This will show a range of different metrics about your data such as **mean**, **max**, and **standard deviation**. ==Use the `describe()` method on the dataframe to show the statistical metrics of the dataset.== Does anything stand out as strange? Make note. ### Data Preprocessing The last step before we get to work in our dataset is to clean and fix the data to suit our needs. This is a stage called Data Preprocessing : Finding anomolies, missing values, and noise in a dataset and fixing them Let's first find *how many* missing values are contained in each column/attribute for the dataset. ==Write a code that searches each column for missing values using a combination of **`isnull()`** (or **`isna()`** for the NaN values) and the **`sum()`** methods.== This will provide you with a list of the number of missing data points for each attribute. Now, we want to pull those rows to fix them. ==Write a code that will loop through the rows in the dataframe and print any row along with its contents if that row contains a **missing** or **NaN** value== > For each of these code elements, comment on why you chose the method you did You should find that the dataset contains both missing and NaN (not a number) values that need to be fixed. Out of the multiple ways we can do that, we'll be looking into two options: * Removing the whole record from the dataset **OR** * Replacing a value with the mean, medium, or mode ### Cleaning this data ==Perform the following tasks to clean the data== 1. **Replace the `trestbps` missing values with the average/mean value of the `trestbps` column** - *can be found in the results of the `describe()` method used previously* 2. **Drop the row that has a missing value under column `chol`** 3. **Replace the `thalach` missing values withthe median value of the `thalach` column** - *can be found in the results of the `describe()` method used previously* > **Hint:** do not forget to use `inplace=True` if you want the dataframe to persist ## Step 4: Comparing the Data Now that we've fixed the dataframe, we can start processing and comparing it. Comparing columns is helpful if you want to start gaining an intuition about how your independent variables interact with your dependent variables. To compare two columns to each other, you can use the function **`pd.crosstab(column_1, column_2, ...)`**. We'll be comparing a few attributes in this section. --- ### Heart Disease Frequency According to Gender Let's start by studying heart disease frequency according to gender. ==**Compare the `target` column with the `sex` column using `crosstab()`**== #### Make the crosstab visual You can plot the crosstab by using the `plot()` function and passing it a few parameters such as, `kind` (the type of plot you want), `figsize=(length, width)` (how big you want it to be) and `color=[colour_1, colour_2]` (the different colours you'd like to use). Different metrics are represented best with different kinds of plots. In our case, a bar graph is great. We'll see examples of more later. And with a bit of practice, you'll gain an intuition of which plot to use with different variables. ==**Make a bar chart of the results of the results of the crosstab between `target` and `sex`**== #### Adding Attributes This chart looks nice but is pretty bare. You'll need to add attributes to label the results. To add attributes, you call them on `plt` within the same cell as where you create the graph. So your cell will contain the `crosstab()` and `plot()` methods *and* labels such as `plt.title()`, `plt.xlabel()` and more. ==Create another plot based on the last graph you generated and add the following attributes to the plot to beautify it== * ==Add title as **Heart Disease Frequency for Sex**== * ==Add xlabel as **0 = No Disease, 1 = Disease**== * ==Add ylabel as **Amount**== * ==Add legends for **Female** and **Male**== * ==*Bonus: Keep the labels on the x-axis vertical*== --- ### Age vs Max Heart Rate for Heart Disease Now, let's try combining a couple independent variables (**`age` and `thalach`**) and comparing them to our target variable of **`heart disease`**. Since there are so many different values for both age and thalach, we'll use a **scatter plot**. ==Create a scatterplot that compares `age`, `thalach`, and `target`. Add the necessary labels to the graph (consider the x and y labels specifically and how to show an whether or not someone has heart disease)== > **What can we infer from this?** > It seems the younger someone is, the higher their max heart rate (dots are higher on the left of the graph) and the older someone is, the more green dots there are. But this may be because there are more dots all together on the right side of the graph (older participants). > > Both of these are observational of course, but this is what we're trying to do, build an understanding of the data. Now, let's check the age **distribution**. ==Create a `plot` of a histogram that shows the distribution of the `age` column== --- ### Heart Disease Frequency per Chest Pain Type Let's try another independent variable. This time **`cp` (chest pain)** We'll use the same process as we did before with `sex`. 1. ==Find the crosstab between `cp` and `target`== 2. ==Create a crosstab bar plot between `cp` and `target`, including appropriate labels== >**What can we infer from this?** >Remember from our data dictionary what the different levels of chest pain are. >``` >cp - chest pain type >* 0: Typical angina: chest pain related decrease blood supply to the heart >* 1: Atypical angina: chest pain not related to heart >* 2: Non-anginal pain: typically esophageal spasms (non heart related) >* 3: Asymptomatic: chest pain not showing signs of disease >``` >It's interesting the atypical agina (value 1) states it's not related to the heart but seems to have a higher ratio of participants with heart disease than not. --- ### Correlation Between Independent Variables Finally, we'll compare all of the indepent variables in one hit. >**Why?** >Because this may give an idea of which independent variables may or may not have an impact on our target variable We can do this using `df.corr()` which will create a [**correlation matrix**](https://www.statisticshowto.datasciencecentral.com/correlation-matrix/) for us, in other words, a big table of numbers telling us how related each variable is the other. ==Run the `df.corr()` method to generate and show the correlation matrix of the various attributes in our dataframe== Once completed, we then want to identify the correlation between our independent variables using a **heatmap** ==Generate a `heatmap()` by `seaborn` to create a correlation matrix from the data== This will produce aa heatmap where a higher positive value means a potential positive correlation (increase) and a higher negative value means a potential negative correlation (decrease) ## Save and continue next week That's all we're going to look at this week. ==**Save your newly processed data and name it `yourname-processed-heart-data` as a `.csv` file.**== Next week, we'll continue working with this dataset, so make sure you'll be able to find it again!