演習ガイダンス

<br> <br> <br> # 演習ガイダンス ### `AI基礎研修` `第3回` ---- ## TOC - 演習ガイダンス - 本日参加するコンペ - 演習(1): データ探索（EDA）をやってみましょう - 演習(2): データ加工をやってみましょう - 演習(3): モデリング・評価をやってみましょう ---- ## 演習のモチベーション #### :white_check_mark: 分類問題の場合の基礎分析法を知る #### :white_check_mark: 分類問題のモデリング・検証方法を知る --- <br> <br> <br> # 本日参加するコンペ ---- ## 概要 https://signate.jp/competitions/107 ---- ## 評価基準 ![](https://i.imgur.com/eK9GP0x.png) $$ accuracy = 正解数/予測数 = 16/20 = 80_\% $$ --- <br> <br> <br> # 演習(1): <br>データ探索（EDA）をやってみましょう ---- ## 成果物 1. 単変量分析結果 - 以下のレポートを作成して下さい。 https://hackmd.io/@0gb380gCQNurcjPTCbAG-Q/H1hIu5CSH 3. 多変量分析結果 - 以下のレポートを作成して下さい。 https://hackmd.io/@0gb380gCQNurcjPTCbAG-Q/SkDt2QArH <br> :::warning - 注意 - 色・フォーマットは気にしなくて構いません。 - 必要な数値・グラフが揃っていれば十分です。 - コードはなるべく短くなるように工夫してみましょう。 ::: ---- ## ヒント ### 単変量分析 ```Python df.astype() ``` ```Python set() ``` ```Python df.isna().sum() ``` ```Python df.groupby(col_name).size() ``` ```Python df.plot.bar() df.plot.hist() ``` ---- ### 多変量分析 ```Python df.pivot_table(index=x, columns=y, aggfunc="size") ``` ```Python df.div(df.sum(1), axis=0) ``` ```Python sns.pairplot() ``` ```Python def zscore(x): mean = np.mean(x) std = np.std(x) return (x - mean)/std ``` ```Python np.log(df + 1) ``` ```Python df.corr("xxx") ``` ```Python sns.heatmap() ``` --- <br> <br> <br> # 演習(2):<br>データ加工をやってみましょう ---- ## 成果物 - 加工後のデータ 1. raw(data/clns/train_raw.csv, test_raw.csv) - 質的変数：onehot化 2. zscore(data/clns/train_zscore.csv, test_zscore.csv) - 質的変数：onehot化 - 量的変数：zscore化（標準化） ---- ## ヒント ```Python def get_onehot(df): lst_oh = list() for c in df.columns: oh = pd.get_dummies(df[c], drop_first=True, prefix=c) lst_oh.append(oh) return pd.concat(lst_oh, axis=1) ``` ```Python df.to_csv() ``` --- <br> <br> <br> # 演習(3):<br>モデリング・評価をやってみましょう - アルゴリズム - ロジスティック回帰 ※第１回目の乳がん判別モデリングを参考に - 目標 - ベースライン：acc >= 0.85 - ベスト：acc >= 0.86 ---- ## ヒント - データ加工 - 標準化 - 対数変換 - べき乗 - チューニング - 正則化 - クロスバリデーション - ※アルゴリズム変更（時間に余裕のある方） - 決定木 - ランダムフォレスト - SVM ---- ## グリッドサーチ ```Python from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV param_grid = {'C' : [0.001, 0.01, 0.1, 1, 10, 100], 'penalty' : ['l1', 'l2'] } grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5) grid_search.fit(X, y) ``` ---- ## クロスバリデーション <img src="https://i.imgur.com/Lf8AFGa.png" width="70%"> ---- ## 提出用ファイルの作成 ```Python pred = model.predict(test_X) test_X["Y"] = [">50K" if v > 0 else "<=50K" for v in pred] submit = test_X[["Y"]] submit.to_csv("xxx", header=False) ```    <style> .reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6, .reveal section, .reveal table, .reveal li, .reveal blockquote, .reveal th, .reveal td, .reveal p { font-family: 'Meiryo UI', 'Source Sans Pro', Helvetica, sans-serif, 'Helvetica Neue', 'Helvetica', 'Arial', 'Hiragino Sans', 'ヒラギノ角ゴシック', YuGothic, 'Yu Gothic'; text-align: left; line-height: 1.6; letter-spacing: normal; text-shadow: none; word-wrap: break-word; color: #444; } .reveal h1, .reveal h2, .reveal h3, .reveal h4, .reveal h5, .reveal h6 {font-weight: bold;} .reveal h1, .reveal h2, .reveal h3 {color: #2980b9;} .reveal th {background: #DDD;} .reveal section img {background:none; border:none; box-shadow:none; max-width: 95%; max-height: 95%;} .reveal blockquote {width: 90%; padding: 0.5vw 3.0vw;} .reveal table {margin: 1.0vw auto;} .reveal code {line-height: 1.2;} .reveal p, .reveal li {padding: 0vw; margin: 0vw;} .reveal .box {margin: -0.5vw 1.5vw 2.0vw -1.5vw; padding: 0.5vw 1.5vw 0.5vw 1.5vw; background: #EEE; border-radius: 1.5vw;} /* table design */ .reveal table {background: #f5f5f5;} .reveal th {background: #444; color: #fff;} .reveal td {position: relative; transition: all 300ms;} .reveal tbody:hover td { color: transparent; text-shadow: 0 0 3px #aaa;} .reveal tbody:hover tr:hover td {color: #444; text-shadow: 0 1px 0 #fff;} /* blockquote design */ .reveal blockquote { width: 90%; padding: 0.5vw 0 0.5vw 6.0vw; font-style: italic; background: #f5f5f5; } .reveal blockquote:before{ position: absolute; top: 0.1vw; left: 1vw; content: "\f10d"; font-family: FontAwesome; color: #2980b9; font-size: 3.0vw; } /* font size */ .reveal h1 {font-size: 5.0vw;} .reveal h2 {font-size: 4.0vw;} .reveal h3 {font-size: 2.8vw;} .reveal h4 {font-size: 2.6vw;} .reveal h5 {font-size: 2.4vw;} .reveal h6 {font-size: 2.2vw;} .reveal section, .reveal table, .reveal li, .reveal blockquote, .reveal th, .reveal td, .reveal p {font-size: 2.2vw;} .reveal code {font-size: 1.6vw;} /* new color */ .red {color: #EE6557;} .blue {color: #16A6B6;} /* split slide */ #c {text-align: center; width: 100%; z-index: -10;} #l {left: 31.25%; text-align: left; float: left; width: 50%; z-index: -10;} #r {left: -18.33%; text-align: left; float: left; width: 50%; z-index: -10;} #l2 {left: 31.25%; text-align: left; float: left; width: 50%; height: 50%; z-index: -10;} #r2 {left: -18.33%; text-align: left; float: left; width: 50%; height: 50%; z-index: -10;} </style>  <style> .reveal { background-image: /* copy right */ /* 個人 */ /* url("https://i.imgur.com/mYSeGwZ.png"); */ /* 顧客 */ url("https://i.imgur.com/R5Lgxvx.png"); /* url("https://i.imgur.com/kNd1W6K.png"); header */ background-repeat: /* no-repeat, */ no-repeat; background-position: center 99%; /* ,center 2%; */ background-size: 30% auto; /* ,90% auto; */ } .reveal h1 {padding: 3.0vw 0vw;} @media screen and (max-width: 1024px) { .reveal h2 {margin: -2.0vw 0 0 0; padding: 0.0vw 0vw 3.0vw 2.0vw; } } @media screen and (min-width: 1025px) and (max-width: 1920px) { .reveal h2 {margin: -1.5vw 0 0 0; padding: 0.0vw 0vw 3.0vw 2.0vw; } } @media screen and (min-width: 1921px) and (max-width: 100000px) { .reveal h2 {margin: -1.0vw 0 0 0; padding: 0.0vw 0vw 3.0vw 2.0vw; } } </style>  <style> .reveal h1 { margin: 0% -100%; padding: 2% 100% 4% 100%; color: #fff; background: #c2e59c; /* fallback for old browsers */ background: linear-gradient(-45deg, #EE7752, #E73C7E, #23A6D5, #23D5AB); background-size: 200% 200%; animation: Gradient 60s ease infinite; } @keyframes Gradient { 0% {background-position: 0% 50%} 50% {background-position: 100% 50%} 100% {background-position: 0% 50%} } .reveal h2 { text-align: center; margin: -5% -50% 2% -50%; padding: 4% 10% 1% 10%; color: #fff; background: #c2e59c; /* fallback for old browsers */ background: -webkit-linear-gradient(to right, #64b3f4, #c2e59c); /* Chrome 10-25, Safari 5.1-6 */ background: linear-gradient(to right, #64b3f4, #c2e59c); /* W3C, IE 10+/ Edge, Firefox 16+, Chrome 26+, Opera 12+, Safari 7+ */ } </style>