owned this note
owned this note
Published
Linked with GitHub
# Lab 0-1
[TOC]
路徑:`/mlsec/frauddetect/logistic-regression-fraud-detection.ipynb`
## panda
Pandas 是 python 的一個數據分析 lib,提供兩種主要的資料結構,Series 與 DataFrame
- Series
用來處理時間序列相關的資料
- DataFrame
處理結構化(Table like)的資料
EX:csv
## Code:
```python=
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
# Read in the data from the CSV file
df = pd.read_csv('datasets/payment_fraud.csv')
df.sample(15)
```
引入panda讀取資料
`.sample(n)`可以隨機選取出n筆資料
```python=
# Convert categorical feature into dummy variables with one-hot encoding
#Q1. Which column is categorical feature
#Q2 Use pd.get_dummies to convert it to one-hot encoding
# Usage pd.get_dummies(<your data>, columns = [<culomn name>])
df = pd.get_dummies(df,column=['paymentMethod'])
```
`get_dummies()`是進行**One-hot encoding**
也是這部分最重要的一環
有些資料不一定具有數字上的意義,但是也是重要的資料不能忽略
我們可以利用**One-hot encoding**將資料轉換成數字型態,讓他可以進行+/-運算
```python=
df.sample(3)
# Split dataset up into train and test sets
# Q3: Use df.drop to drop label and Generate feature data
# Q4: Split data into 2:1 by train_test_split
# Usage: X_train, X_test, y_train, y_test = train_test_split(features data frame, label data frame, test_size=<test_size>, random_state=17)
X_train,X_test,y_train,y_test = train_test_spilt(df.drop('label',axis=1),df['lable'],test_size=0.33, random_state=17)
```
使用`train_test_spilt()`將資料分成**train data**和**test data**
`.drop()`可以刪除特定資料,其中`axis`參數為`0`刪除column;`1`刪除row
`test_size`是樣本占比,如果輸入整數則為數量
`random_state`是亂數種子,設定好之後可以保證每次拿到的數值都相同,在重複驗證時很好用
```python=
#df = pd.get_dummies(df, columns=['paymentMethod'])
# Initialize and train classifier model
# Q5: New LogisticRegression Model and fit the data you have
# Usage: clf = LogisticRegression().fit(feature data frame, label data frame)
clf = LogisticRegression().fit(X_train,y_tain)
y_pred = clf.predict(X_test)
```
使用`LogisticRegression()`以及`.fit()`進行訓練
並且進行預測`.predict()`
```python=
# Q6: Use predict to test sample
# Make predictions on test set
# Usage: clf.predict(data to test)
# Compare test set predictions with ground truth labels
print(accuracy_score(y_pred, y_test))
print(confusion_matrix(y_test, y_pred))
```
輸出最終準確度`qccuracy_score()`
運用`confusion_matrix()`計算混淆矩陣來評估準確性
最終訓練結果:
```
0.999922738159623
[[12753 0]
[ 1 189]]
```
<style>
span.hidden-xs:after {
content: ' × ML Security' !important;
}
</style>
###### tags: `ML Security`