# EDA for beginners **Due Date:** 2 April 2025 **Submission Link:** [Google form](https://forms.gle/34bdySwp8gM9r6gw6) **Grading:** Each task has 10 points max :::info :bulb:**NOTE:** After each task you need to write a report. ::: Example structure: ``` ├── hw2.ipynb <- jupyter notebook with solutions ``` ## :memo: Data Inspection & Cleaning (20 min) Load the "Titanic" dataset into a Pandas DataFrame. Display the first few rows and check the dataset’s structure (df.info(), df.describe()). Identify missing values and decide how to handle them (e.g., drop or fill missing values). Check for duplicate rows and remove them if necessary. Convert categorical variables into a suitable format if needed. ## :memo: Summary Statistics & Initial Insights (20 min) Calculate basic summary statistics (mean, median, standard deviation) for numerical columns. Identify correlations between numerical features (use .corr()). Group and analyze specific categories (e.g., average age of survivors vs. non-survivors). Answer a few analytical questions, such as: What percentage of passengers survived? How does survival rate vary by gender? Is there a significant difference in survival rate based on passenger class? ## Data Visualization (20 min) Create at least three visualizations to explore the data: A histogram or boxplot to show the age distribution. A bar chart to compare survival rates by gender. A scatter plot or heatmap to examine correlations between variables. Use Matplotlib or Seaborn to generate the plots. Below example: ``` import seaborn as sns import matplotlib.pyplot as plt sns.countplot(data=df, x="sex", hue="survived") plt.title("Survival Count by Gender") plt.show() ```