# Topic 1: Notebooks and Data Basics
This topic combines Python notebook setup with essential data importing and exploration using pandas.
## Table of Contents
- [Topic 1: Notebooks and Data Basics](#topic-1-notebooks-and-data-basics)
- [Table of Contents](#table-of-contents)
- [Expected Knowledge](#expected-knowledge)
- [Learning Resources](#learning-resources)
- [Python Installation](#python-installation)
- [Python Notebooks](#python-notebooks)
- [Data Handling with Pandas](#data-handling-with-pandas)
- [Terms](#terms)
- [Material Summary](#material-summary)
- [Environment Setup](#environment-setup)
- [Data Fundamentals with Pandas](#data-fundamentals-with-pandas)
- [Documentation and Best Practices](#documentation-and-best-practices)
- [Case Study: Complete Data Pipeline](#case-study-complete-data-pipeline)
- [Task Requirements](#task-requirements)
- [Expected Deliverables](#expected-deliverables)
- [Assessment Criteria](#assessment-criteria)
## Expected Knowledge
After this sub module, the student should be able to:
- Set up Python notebooks in VS Code or Google Colab
- Import data using pandas
- Use basic pandas operations (head, shape, info, describe)
- Create simple visualizations
## Learning Resources
### Python Installation
- Highly Suggested: [Python Installation for Windows](https://youtu.be/TNAu6DvB9Ng?si=cV-NjA3YsUMC8MSg)
- It is suggested to install Python 3.12 because many libraries used in data science are compatible with this version.
### Python Notebooks
- Recommended: [Python Notebooks in VS Code](https://www.youtube.com/watch?v=DA6ZAHBPF1U)
- Optional: [Getting Started with Google Colab](https://www.youtube.com/watch?v=inN8seMm7UI)
### Data Handling with Pandas
- Optional (In-depth): [Python for Data Science - Cognitive Class](https://cognitiveclass.ai/courses/python-for-data-science)
- [Pandas 10 minutes tutorial](https://pandas.pydata.org/docs/user_guide/10min.html)
- [Basic Data Exploration with Pandas](https://www.youtube.com/watch?v=xi0vhXFPegw)
## Terms
- **Jupyter Notebook**: Cell-based interactive environment for running code in chunks
- **DataFrame**: 2D labeled data structure in pandas, like a spreadsheet
- **Pandas**: Python library for data manipulation and analysis
- **Missing Values (NaN)**: Absent or undefined data points
- **CSV**: Comma-Separated Values file format for tabular data
## Material Summary
This module covers the foundational skills needed for data science work:
### Environment Setup
**Python Installation Process:**
1. Download Python 3.12 from the [official website](https://www.python.org/ftp/python/3.12.7/python-3.12.7-amd64.exe)
2. Run the installer with "Add Python to PATH" option checked
3. Verify installation using command prompt: `python --version`
4. Install pip package manager (usually included with Python)
5. Test pip functionality: `pip --version`
**Notebook Environment Configuration:**
1. **VS Code Setup:**
- Install VS Code from official website
- Install Python extension from VS Code marketplace
- Install Jupyter extension for notebook support
- Create new `.ipynb` file to start working
- Configure Python interpreter path
- install necessary libraries using pip:
```bash
pip install pandas numpy matplotlib seaborn jupyter
```
2. **Google Colab Alternative:**
- Access https://colab.research.google.com
- Sign in with Google account
- Create new notebook from interface
- Understand runtime limitations and GPU access
- Learn file upload and download procedures
- Install additional libraries using `!pip install package_name` in code cells
- For example, to install pandas: `!pip install pandas numpy matplotlib seaborn`
### Data Fundamentals with Pandas
**Data Loading Workflow:**
1. Import necessary libraries:
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```
2. Load CSV files:
```python
# Basic CSV loading
df = pd.read_csv('filename.csv')
```
**Essential Data Exploration Steps:**
1. **Initial Data Inspection:**
```python
# View first 5 rows
print(df.head())
# View last 5 rows
print(df.tail())
# Get dimensions (rows, columns)
print(f"Dataset shape: {df.shape}")
# List all column names
print(f"Columns: {df.columns.tolist()}")
```
2. **Data Quality Assessment:**
```python
# Check data types and non-null counts
print(df.info())
# Count missing values per column
print(df.isnull().sum())
# Check percentage of missing values
print((df.isnull().sum() / len(df)) * 100)
# Examine data types in detail
print(df.dtypes)
```
3. **Statistical Overview:**
```python
# Generate statistical summary for numerical columns
print(df.describe())
# Count unique values for a specific column
print(df['column_name'].value_counts())
# Count unique values per column
print(df.nunique())
# Calculate correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)
# Basic visualization
df.hist(figsize=(12, 8))
plt.show()
```
4. **Data Filtering and Selection:**
```python
# Select specific columns
subset = df[['column1', 'column2']]
# Filter rows based on condition
filtered_df = df[df['age'] > 30]
# Multiple conditions
filtered_df = df[(df['age'] > 30) & (df['survived'] == 1)]
# Check unique values
print(df['column_name'].unique())
```
### Documentation and Best Practices
**Notebook Organization:**
1. Use markdown cells for explanations and section headers
2. Add comments to code cells explaining logic
3. Use descriptive variable names
4. Organize code in logical sections
5. Include conclusions and insights in markdown
**Troubleshooting Common Issues:**
1. File path problems and working directory confusion
2. Missing library installations and import errors
3. Data encoding issues with special characters
4. Memory limitations with large datasets
5. Version compatibility between pandas and Python
This foundation prepares students for more advanced machine learning concepts in subsequent modules.
## Case Study: Complete Data Pipeline
### Task Requirements
**Setup:**
1. Choose VS Code or Google Colab for your environment
2. Create a new Python notebook
**Part A: Data Exploration with Titanic Dataset:**
1. Import the Titanic dataset from Kaggle: https://www.kaggle.com/c/titanic/data
2. Load the data using pandas
3. Explore the dataset structure:
- Use `df.head()` to view first 5 rows
- Use `df.shape` to see dimensions
- Use `df.info()` to check data types and missing values
- Use `df.describe()` for statistical summary
**Documentation:**
- Use markdown cells to document each step
- Explain what each operation reveals about the data
- Document preprocessing decisions and their rationale
### Expected Deliverables
- Working Python notebook
- Screenshots showing successful notebook setup
- Summary of data exploration findings
## Assessment Criteria
Students should demonstrate:
- [ ] Successful notebook environment setup
- [ ] Correct data import using pandas
- [ ] Proper use of pandas exploration methods (head, shape, info, describe)
- [ ] Basic visualization creation