Data Science(Python)

- Extensive Libraries and Ecosystem: - Python boasts a rich ecosystem of libraries and tools for data science, including NumPy, pandas, Matplotlib, Seaborn, Scikit-learn, TensorFlow, PyTorch, and more. - Data Visualization: - Python provides excellent libraries for data visualization, such as Matplotlib and Seaborn. Visualizing data is a crucial aspect of data science, helping you understand patterns, trends, and insights from your datasets. - Integration with Big Data Technologies: - Python seamlessly integrates with big data technologies such as Apache Spark. This is essential for handling large-scale data processing and analytics. ![螢幕截圖 2023-11-28 上午9.51.08](https://hackmd.io/_uploads/rkRcDTGrT.png) --- ![img_v2_cb8d75e4-f22a-4b2f-af92-34c394e4d5bh](https://hackmd.io/_uploads/S1WwXEXS6.png) --- --- ![image](https://hackmd.io/_uploads/Syya7VXr6.png) ![image](https://hackmd.io/_uploads/Sy_T7EmH6.png) --- ### Print "Hello World" ```python= print("Hello World") ``` --- ### Variable declaration ```python= variable_name = value ``` --- ### Operators #### Arithmetic Operators ```python= # Addition result_add = 5 + 3 # result_add is 8 # Subtraction result_sub = 7 - 2 # result_sub is 5 # Multiplication result_mul = 4 * 6 # result_mul is 24 # Division result_div = 15 / 3 # result_div is 5.0 (float) # Modulus (remainder) result_mod = 17 % 5 # result_mod is 2 # Exponentiation result_exp = 2 ** 3 # result_exp is 8 ``` #### Comparison Operators ```python= # Equal to is_equal = (10 == 10) # is_equal is True # Not equal to not_equal = (5 != 3) # not_equal is True # Greater than greater_than = (8 > 5) # greater_than is True # Less than less_than = (4 < 7) # less_than is True # Greater than or equal to greater_equal = (6 >= 6) # greater_equal is True # Less than or equal to less_equal = (9 <= 9) # less_equal is True ``` #### Logical Operators ```python= # Logical AND logical_and = True and False # logical_and is False # Logical OR logical_or = True or False # logical_or is True # Logical NOT logical_not = not True # logical_not is False ``` --- ### Data Types #### Numbers ```python= # Integer num_int = 10 # Float num_float = 3.14 ``` #### Strings ```python= # Single line string single_line_str = "Hello, World!" # Multi-line string multi_line_str = """This is a multi-line string.""" ``` --- ### Conditions and Loops #### if-else ```python= age = 18 if age >= 18: print("You are an adult.") elif age<16: print("FBI is coming") else: print("You are a minor.") ``` #### For Loop ```python= numbers = [1, 2, 3, 4, 5] for num in numbers: print(num) ``` #### while loop ```python= count = 0 while count < 5: print(count) count += 1 ``` --- ### Data Structures #### Lists ```python= # List declaration fruits = ['aple','banana','orange'] # Accessing list items first_fruit = fruits[0] # first_fruit is 'apple' last_fruit = fruits[-1] # last_fruit is 'orange' # Modifying list items fruits[0] = 'grape' # Now, fruits is ['grape', 'banana', 'orange'] # Adding items to a list fruits.append('kiwi') # Now, fruits is ['grape', 'banana', 'orange', 'kiwi'] fruits.insert(1, 'melon') # Now, fruits is ['grape', 'melon', 'banana', 'orange', 'kiwi'] # Removing items from a list fruits.remove('banana') # Now, fruits is ['grape', 'melon', 'orange', 'kiwi'] removed_item = fruits.pop(2) # Now, fruits is ['grape', 'melon', 'kiwi'], and removed_item is 'orange' # List slicing sliced_list = fruits[1:3] # sliced_list is ['melon', 'kiwi'] every_second_item = fruits[::2] # every_second_item is ['grape', 'kiwi'] # Count Rows and Columns of List ,return a list return list(players.shape) ``` #### Tuples - Immutable(不可變 ```python= # Tuple declaration fruit_tuple = ("apple", "banana", "orange", "grape") # Accessing tuple items first_fruit = fruit_tuple[0] # first_fruit is 'apple' last_fruit = fruit_tuple[-1] # last_fruit is 'grape' # Tuple slicing sliced_fruits = fruit_tuple[1:3] # sliced_fruits is ('banana', 'orange') ``` #### Sets ```python= # Set declaration fruits_set1 = {"apple", "banana", "orange"} fruits_set2 = {"orange", "grape", "kiwi"} # Adding items to a set fruits_set1.add("strawberry") # Removing items from a set fruits_set2.remove("kiwi") # Set operations union_set = fruits_set1.union(fruits_set2) # Union of sets intersection_set = fruits_set1.intersection(fruits_set2) # Intersection of sets difference_set = fruits_set1.difference(fruits_set2) # Set difference print("Union Set:", union_set) print("Intersection Set:", intersection_set) print("Difference Set:", difference_set) ``` #### Dictionaries ```python= # Dictionary declaration person = {"name": "John", "age": 25, "city": "New York"} # Accessing dictionary values name = person["name"] # name is 'John' # Modifying dictionary values person["age"] = 26 # Now, person is {"name": "John", "age": 26, "city": "New York"} # Adding items to a dictionary person["gender"] = "Male" # Now, person is {"name": "John", "age": 26, "city": "New York", "gender": "Male"} # Removing items from a dictionary del person["city"] # Now, person is {"name": "John", "age": 26, "gender": "Male"} # Getting keys and values from a dictionary keys = person.keys() # keys is dict_keys(['name', 'age', 'gender']) values = person.values() # values is dict_values(['John', 26, 'Male']) ``` #### List Comprehension ```python= # Creating a new list from an existing list numbers = [1, 2, 3, 4, 5] squared_numbers = [x**2 for x in numbers] # Conditionally creating a new list from an existing list even_numbers = [x for x in numbers if x % 2 == 0] print("Original List:", numbers) print("Squared Numbers:", squared_numbers) print("Even Numbers:", even_numbers) ``` #### Dictionary Comprehension ```python= # Creating a new dictionary from an existing list numbers = [1, 2, 3, 4, 5] squared_numbers_dict = {x: x**2 for x in numbers} # Conditionally creating a new dictionary from an existing dictionary original_dict = {'a': 1, 'b': 2, 'c': 3} even_numbers_dict = {key: value for key, value in original_dict.items() if value % 2 == 0} print("Original Dictionary:", original_dict) print("Squared Numbers Dictionary:", squared_numbers_dict) print("Even Numbers Dictionary:", even_numbers_dict) ``` --- ### Functions ```python= # Function definition def greet(name): return "Hello, {name}!" # Function call greeting = greet("Alice") # greeting is "Hello, Alice!" ``` --- ### Error Handling ```python= # try-except block try: # code block except ErrorType: # code block # try-except-else block try: # code block except ErrorType: # code block else: # code block # try-except-finally block try: # code block except ErrorType: # code block finally: # code block ``` --- ## ==Modules and Packages== ```python= # Importing a module import module_name # Importing a specific function from a module from module_name import function_name # Importing all functions from a module from module_name import * # Importing a package import package_name # Importing a specific module from a package from package_name import module_name # Importing a specific function from a module in a package from package_name.module_name import function_name ``` --- ### ==File Handling== ```python= # Opening a file file = open("filename", "mode") # Reading from a file file_contents = file.read() # Writing to a file file.write("text") # Closing a file file.close() ``` --- ### ==Virtual Environments== ```python= # Creating a virtual environment python -m venv virtual_environment_name # Activating a virtual environment source virtual_environment_name/bin/activate # Installing packages in a virtual environment pip install package_name # Deactivating a virtual environment deactivate ``` --- ### ==Context Managers== ```python= # Context manager class class ContextManagerClass: def __enter__(self): # code block return value def __exit__(self, exc_type, exc_value, traceback): # code block # Using a context manager with 'with' statement with ContextManagerClass() as value: # code block ``` --- ### Classes and Objects ```python= # Class declaration class ClassName: def __init__(self, parameter1, parameter2): self.parameter1 = parameter1 self.parameter2 = parameter2 def method_name(self): # code block # Object creation object_name = ClassName(argument1, argument2) # Accessing object properties property_value = object_name.property_name # Calling object methods object_name.method_name() ``` --- ## Inheritance ```python= # Parent class class ParentClass: def parent_method(self): # code block # Child class class ChildClass(ParentClass): def child_method(self): # code block # Object creation object_name = ChildClass() # Accessing inherited methods object_name.parent_method() ``` --- ## Polymorphism ```python= # Parent class class ParentClass: def polymorphic_method(self): # code block # Child class 1 class ChildClass1(ParentClass): def polymorphic_method(self): # code block # Child class 2 class ChildClass2(ParentClass): def polymorphic_method(self): # code block # Object creation object1 = ChildClass1() object2 = ChildClass2() # Polymorphic method calls object1.polymorphic_method() object2.polymorphic_method() ``` --- ## Lambda Functions ```python= # Lambda function declaration lambda_function = lambda p1,p2: expression # Lambda function call result = lambda_function(a1, a2) ``` --- ## Map, Filter, and Reduce ```python= # Map function new_list = map(function, iterable) # Filter function new_list = filter(function, iterable) # Reduce function from functools import reduce result = reduce(function, iterable) ``` --- ## Decorators ```python= # Decorator function def decorator_function(original_function): def wrapper_function(*args, **kwargs): # code before original function result = original_function(*args, **kwargs) # code after original function return result return wrapper_function # Applying a decorator to a function @decorator_function def original_function(*args, **kwargs): # code block ``` --- ## Generators ```python= # Generator function def generator_function(): for i in range(10): yield i # Using a generator for value in generator_function(): # code block ``` --- ### Threading and Multiprocessing ```python= import threading import multiprocessing # Threading thread = threading.Thread(target=function_name, args=(argument1, argument2)) thread.start() # Multiprocessing process = multiprocessing.Process(target=function_name, args=(argument1, argument2)) process.start() ``` --- ## We will use google colab to learn python [Google Colab](https://colab.google) [Kraggle titanic-dataset](https://www.kaggle.com/datasets/mahmoudshogaa/titanic-dataset) --- ![Numpy](https://numpy.org/doc/stable/_static/numpylogo.svg) ### Numpy - Creating and editing multidimensioal arrays - 1D,2D,3D arrays - Arithmetic operations - Array comparsion - Array manipulation - 數組操作： NumPy 提供 ndarray 對象，可進行高效的數組操作和數學計算。 - 數學函數：內置了豐富的數學函數，如線性代數、統計學、傅立葉變換等，支援多維數組計算。 - 隨機數生成：提供隨機數生成函數，用於模擬和實驗。 - 數組索引和切片：支援數組的靈活索引和切片，使數據提取更加方便。 - 廣播：能夠進行不同形狀數組之間的廣播操作，提高代碼的可讀性和性能。 ![Numpy](https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41586-020-2649-2/MediaObjects/41586_2020_2649_Fig1_HTML.png?as=webp) ```python= # 安裝 NumPy pip install numpy # 使用 NumPy 創建數組 import numpy as np arr = np.array([1, 2, 3, 4, 5]) # 數組運算 arr_squared = arr ** 2 # 線性代數操作 matrix = np.array([[1, 2], [3, 4]]) det = np.linalg.det(matrix) # 隨機數生成 random_array = np.random.rand(5) # 數組索引和切片 subset = arr[1:4] ``` [numpy datacamp](https://www.datacamp.com/cheat-sheet/numpy-cheat-sheet-data-analysis-in-python) [numpy cheetsheet](https://images.datacamp.com/image/upload/v1676302459/Marketing/Blog/Numpy_Cheat_Sheet.pdf) --- ![pandas](https://upload.wikimedia.org/wikipedia/commons/thumb/e/ed/Pandas_logo.svg/512px-Pandas_logo.svg.png) ### Pandas - Creating and editing series & dataframes - Series and dataframe structure - Data sorting - Applying functions - Dataframe manipulation - 數據結構： Pandas 提供 DataFrame 和 Series 兩種主要的數據結構，用於處理二維表格數據。 - 數據清洗：支援缺失數據處理、數據合併、重複數據刪除等功能，使數據更加乾淨。 - 數據選擇和過濾：提供直觀的方式來選擇和過濾數據，快速提取需要的信息。 - 統計分析：內置許多統計分析函數，支援快速生成統計數據和可視化。 - 數據讀寫：能夠從多種數據源讀取數據，並將處理後的數據保存到文件。 ```python= # 安裝 Pandas pip install pandas # 使用 Pandas 創建 DataFrame import pandas as pd data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'San Francisco', 'Los Angeles']} df = pd.DataFrame(data) # 數據選擇和過濾 subset = df[df['Age'] > 30] # 統計分析 mean_age = df['Age'].mean() # 數據讀寫 df.to_csv('data.csv', index=False) ``` [pandas datacamp](https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python) [pandas cheetsheet](https://images.datacamp.com/image/upload/v1676302204/Marketing/Blog/Pandas_Cheat_Sheet.pdf) --- 1. iloc方法： iloc是透過==整數位置==來選擇數據的方法，即基於數據的位置進行索引。它使用整數索引來定位行和列。語法為： python Copy code df.iloc[row_index, column_index] 其中，row_index表示行的整數位置，可以是單個整數、切片或者整數列表；column_index表示列的整數位置，同樣可以是單個整數、切片或整數列表。例子： python Copy code import pandas as pd data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]} df = pd.DataFrame(data) # 選擇第二行第三列的元素 value = df.iloc[1, 2] 2. loc方法： loc是透過==標籤==來選擇數據的方法，即基於==行和列==的標籤進行索引。語法為： python Copy code df.loc[row_label, column_label] 其中，row_label表示行的標籤，可以是單個標籤、標籤列表或者標籤切片；column_label表示列的標籤，同樣可以是單個標籤、標籤列表或者標籤切片。例子： python Copy code import pandas as pd data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]} df = pd.DataFrame(data, index=['row1', 'row2', 'row3']) # 選擇標籤為'row2'和'row3'的行，以及標籤為'B'和'C'的列 subset = df.loc[['row2', 'row3'], ['B', 'C']] 總結： iloc使用整數位置進行索引，適用於基於==位置==的索引。 loc使用標籤進行索引，適用於基於==標籤==的索引。 --- ## Importing and Exporting ### 1. Reading Data from Internal Files - a. Reading CSV Files: ```python import pandas as pd # Read CSV file data = pd.read_csv('filename.csv') # Display the DataFrame print(data) ``` - b. Reading Excel Files: ```python import pandas as pd # Read Excel file data = pd.read_excel('filename.xlsx', sheet_name='Sheet1') # Display the DataFrame print(data) ``` - c. Reading Text Files: ```python= # Read text file with open('filename.txt', 'r') as file: data = file.read() # Display the data print(data) ``` ### 2. Reading Data from APIs * .csv is a common form of file for transferring datasets * Reading data from internal files * Reading data from APIs - a. Using requests Library for JSON APIs: ```python= import requests import json # Make a GET request to the API response = requests.get('https://api.example.com/data') # Parse JSON data data = json.loads(response.text) # Display the data print(data) ``` - b. Using pandas for API Responses: ```python= import pandas as pd # Make a GET request to the API url = 'https://api.example.com/data' data = pd.read_json(url) # Display the DataFrame print(data) ``` --- ## Sorting Data & Applying Functions - Sorting data in a series or dataframe - Applying functions to elements stored in a series or dataframe --- ![Plotly](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRxI7EP7lq_j638sTude2jZ7p-ekNpPP_zpGrjnv5KNf9NCFxLJmVVwXnCwOqnDnT6ladA&usqp=CAU) ### Plotly - Interactive data visualizations - Basic charts(e.g. Line plots) - Statistical charts(e.g. Histograms) - Maps(e.g. Bubble map) - 3D charts - 交互性數據可視化： Plotly 是一個用於建立交互性數據可視化的庫，支援線圖、散點圖、長條圖等多種圖表。 - 線上共享和嵌入：可將 Plotly 圖表分享到 Plotly 在線平台，也可以輕鬆嵌入到網頁中，實現動態數據呈現。 - 豐富的圖表設定：提供豐富的圖表配置選項，支援自定義佈局、主題、標籤等，使圖表更具吸引力。 - 支援多種編程語言： Plotly 不僅支援 Python，還支援 R、JavaScript 等多種編程語言，實現跨平台開發。 ```python= # 安裝 Plotly pip install plotly # 使用 Plotly 繪製散點圖 import plotly.express as px # 創建數據框 data = px.data.iris() # 繪製散點圖 fig = px.scatter(data, x='sepal_width', y='sepal_length', color='species', size='petal_length') # 顯示圖表 fig.show() ``` --- ![scipy](https://studyopedia.com/wp-content/uploads/2023/07/scipy.png) ### SciPy - Advanced mathematical / statistical functions - Shape manipulation - Vectoriation - Matrix operations - 科學計算工具套件： SciPy 是一個建立在 NumPy 基礎上的科學計算工具套件，提供了許多高效且方便的數學函數。 - 優化和線性代數： SciPy 包含優化、線性代數、積分和許多其他數學操作的函數，為科學家和工程師提供了豐富的工具。 - 統計分析：提供統計學函數，能夠進行統計分析、機器學習和數據視覺化。 - 訊號和圖像處理：提供了處理訊號和圖像的函數，用於信號處理、影像處理和計算機視覺。 - 科學實驗設計和最佳化： SciPy 提供了實驗設計和最佳化問題的解決方案，可應用於實驗室研究和工程領域。 ```python= # 安裝 SciPy pip install scipy # 使用 SciPy 的最佳化模組 from scipy.optimize import minimize # 定義一個最小化的目標函數 def objective_function(x): return x**2 + 4*x + 4 # 最小化目標函數 result = minimize(objective_function, 0) # 顯示最小化結果 print(result) ``` --- # smtplib smtplib模組是 Python 中的一個工具，用於實現 SMTP（Simple Mail Transfer Protocol）客戶端會話，使您能夠向支持 SMTP 或 ESMTP 的郵件伺服器發送郵件。以下是對 smtplib 模組的簡要介紹： 1. SMTP 類別 (smtplib.SMTP)： - 用於建立 SMTP 連接，支援 SMTP 和 ESMTP 操作。 - 提供 connect()、helo()、ehlo() 等方法進行連接和身份驗證。 - 用於發送郵件的 sendmail() 方法。 2. SMTP_SSL 類別 (smtplib.SMTP_SSL)： - 與 SMTP 類別相似，但提供 SSL 加密的連接。 - 使用 starttls() 不適用的情況下，直接支援 SSL 連接。 - 支援的身份驗證機制包括 CRAM-MD5、PLAIN 和 LOGIN。 3. LMTP 類別 (smtplib.LMTP)： - 用於支援 LMTP（Local Mail Transfer Protocol）的連接。 - 支援 Unix sockets 和常規主機:端口伺服器的連接。 4. 例外（Exceptions）： - SMTPException 是所有其他例外的基類。 - SMTPServerDisconnected 在伺服器意外斷開連接時引發。 - SMTPResponseException 是所有包含 SMTP 錯誤碼的例外的基類。 5. SMTP 方法： - set_debuglevel() 設置調試輸出級別。 - connect() 連接到指定主機和端口。 - helo() 和 ehlo() 用於身份驗證。 - sendmail() 用於發送郵件。 - starttls() 將 SMTP 連接轉換為 TLS（Transport Layer Security）模式。 ```python= import smtplib # 設定郵件參數 from_addr = "your_email@gmail.com" to_addrs = ["recipient1@example.com", "recipient2@example.com"] subject = "Test Email" body = "This is a test email from Python." # 連接到 SMTP 伺服器 with smtplib.SMTP("smtp.gmail.com", 587) as server: # 啟用 TLS 模式 server.starttls() # 登錄到郵件伺服器 server.login("your_email@gmail.com", "your_password") # 構造郵件內容 msg = f"Subject: {subject}\n\n{body}" # 發送郵件 server.sendmail(from_addr, to_addrs, msg) # 郵件發送完成後，連接會自動關閉 ``` --- ### 與springboot配合 ```python= # python_script.py import requests def get_user_by_id(user_id): api_url = f"http://localhost:8080/api/users/{user_id}" response = requests.get(api_url) if response.status_code == 200: user_data = response.json() return user_data else: return None # 使用例子 user_id = 1 user_data = get_user_by_id(user_id) print(user_data) ``` --- # Data Manipulation & Visualizations ## Data Visualization(Cont.) - Common applications of data visualization(Plotly) Multiple Y-Axes • Box Plot • Log Plot • Heat Map Time Series • Gantt Chart ## Data Visualization (Cont.) Adding custom controls to plots Sliders • Timeframe • Restyling ## Apply data visualizations ## Data Cleaning Making imported data more usable for analysis * Restructuring the dataset * Resolving anomalies * Resolving null data * Extrapolating missing data ## Common Biases & Fallacies The common pitfalls during data analysis Regression Toward the Mean Correlation ≠ Causation Sampling Bias Quantitative Fallacy Apply data cleaning prior to visualization ## Regression Finding the correlation between variables. • Direct correlation • Weak/no correlation • Inverse correlation ## Modelling Fundamentals What is the purpose of creating data models? ## Machine Learning 101 Basic concepts about building machine learning models: • Supervised vs unsupervised • Training data & test data • Underfitting and overfitting • Model refinement ## Classification Using provided data to train a model, and then classify unknown entries based on other features ## Clustering Using available data to identify the “appropriate” label for the dataset. ## Web Scraping Using BeautifulSoup to scrap web-based data on the browser without the need to download a dataset made available by third-party providers. - Example 1: Basic Web Scraping with BeautifulSoup: ```python= # Import necessary libraries from bs4 import BeautifulSoup import requests # Specify the URL of the website you want to scrape url = 'https://example.com' # Make a GET request to the website response = requests.get(url) # Parse the HTML content of the page soup = BeautifulSoup(response.text, 'html.parser') # Extract specific information from the page title = soup.title.text paragraphs = soup.find_all('p') # Print the results print(f'Title: {title}') for i, paragraph in enumerate(paragraphs, 1): print(f'Paragraph {i}: {paragraph.text}') ``` - Example 2: Scraping Images from a Website: ```python= import requests from bs4 import BeautifulSoup from urllib.parse import urljoin url = 'https://example.com/images' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Find all image tags img_tags = soup.find_all('img') # Download and save each image for img_tag in img_tags: img_url = urljoin(url, img_tag['src']) img_data = requests.get(img_url).content with open(f'img_{img_tags.index(img_tag) + 1}.jpg', 'wb') as f: f.write(img_data) ``` - Example 3: Scraping Data from a Table: ```python= import requests from bs4 import BeautifulSoup url = 'https://example.com/table-data' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Find the table on the page table = soup.find('table') # Extract data from the table for row in table.find_all('tr'): columns = row.find_all('td') if columns: # Check if it's not a header row data = [column.text.strip() for column in columns] print(data) ```