# Kubeflow實作筆記:新增 Random Forest 演算法
## 目錄
[TOC]
## 參考資料
:::warning
建議先看:
https://towardsdatascience.com/kubeflow-pipelines-how-to-build-your-first-kubeflow-pipeline-from-scratch-2424227f7e5
:::
https://github.com/FernandoLpz/Kubeflow_Pipelines
https://hub.docker.com/r/fernandolpz/only-tests/tags
https://hub.docker.com/repository/docker/lightnighttw/kubeflow/general
## 架構
每個方塊代表一個組件,每個組件都是一個容器。

# Random Forest Image
## 建立新資料夾randomForest

## Dockerfile
```gherkin=
FROM python:3.8-slim
WORKDIR /pipelines
COPY requirements.txt /pipelines
RUN pip install -r requirements.txt
COPY randomforest.py /pipelines
```
## ramdom_forest.yaml
```gherkin=
name: Random Forest classifier
description: Train a random forest classifier
inputs:
- {name: Data, type: LocalPath, description: 'Path where data is stored.'}
outputs:
- {name: Accuracy, type: Float, description: 'Accuracy metric'}
implementation:
container:
image: lightnighttw/kubeflow:random_forest_v4
command: [
python, randomforest.py,
--data,
{inputPath: Data},
--accuracy,
{outputPath: Accuracy},
]
```
## randomforest.py
```gherkin=
import json
import argparse
from pathlib import Path
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
def _randomforest(args):
# Open and reads file "data"
with open(args.data) as data_file:
data = json.load(data_file)
# The excted data type is 'dict', however since the file
# was loaded as a json object, it is first loaded as a string
# thus we need to load again from such string in order to get
# the dict-type object.
data = json.loads(data)
x_train = data['x_train']
y_train = data['y_train']
x_test = data['x_test']
y_test = data['y_test']
# Initialize and train the model
model = RandomForestClassifier(n_estimators=100, criterion = 'gini')
model.fit(x_train, y_train)
# Get predictions
y_pred = model.predict(x_test)
# Get accuracy
accuracy = accuracy_score(y_test, y_pred)
# Save output into file
with open(args.accuracy, 'w') as accuracy_file:
accuracy_file.write(str(accuracy))
if __name__ == '__main__':
# Defining and parsing the command-line arguments
parser = argparse.ArgumentParser(description='My program description')
parser.add_argument('--data', type=str)
parser.add_argument('--accuracy', type=str)
args = parser.parse_args()
# Creating the directory where the output file will be created (the directory may or may not exist).
Path(args.accuracy).parent.mkdir(parents=True, exist_ok=True)
_randomforest(args)
```
## requirements.txt
:::danger
sklearn在新版的pip已經不能用,需要pip install scikit-learn
:::
```gherkin=
scikit-learn
```
## 上傳到Docker Hub
[安裝docker](https://www.docker.com)
在ramdom_forest.yaml定義了image的取用位址lightnighttw/kubeflow:random_forest_v4(需自訂)
[Docker Hub](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjwy-SU35f-AhWRU94KHU9aDq4QFnoECBEQAQ&url=https%3A%2F%2Fhub.docker.com%2F&usg=AOvVaw08oOGeIqomrCmPs9p07hDk)
沒帳號就申請一個
```gherkin=
#將以上四個檔案下載到自己的電腦的新資料夾裡
#登入
docker login -u "docker用户名"
#docker build
docker build --platform=linux/amd64 -t <docker-registry-username>/<docker-image-name>:<tag_name> . -f Dockerfile
#推送到Docker Hub
docker push <docker-registry-username>/<docker-image-name>:<tag_name>
Ex: docker push lightnighttw/kubeflow:random_forest_v4
```
如果不行就用我的[image](https://hub.docker.com/repository/docker/lightnighttw/kubeflow/general)
# 建立pipeline
## pipeline.py
把前面所有組建的資訊(yaml)引入編譯成管線的yaml檔
```gherkin=
import kfp
from kfp import dsl
from kfp.components import func_to_container_op
@func_to_container_op
def show_results(decision_tree : float, logistic_regression : float, random_forest : float) -> None:
# Given the outputs from decision_tree and logistic regression components
# the results are shown.
print(f"Decision tree (accuracy): {decision_tree}")
print(f"Logistic regression (accuracy): {logistic_regression}")
print(f"Random forest (accuracy): {random_forest}")
def add_resource_constraints(op: dsl.ContainerOp):
return op.set_cpu_request("1").set_cpu_limit("2")
@dsl.pipeline(name='Three Pipeline', description='Applies Decision Tree, random forest and Logistic Regression for classification problem.')
def first_pipeline():
# Loads the yaml manifest for each component
download = kfp.components.load_component_from_file('download_data/download_data.yaml')
decision_tree = kfp.components.load_component_from_file('decision_tree/decision_tree.yaml')
logistic_regression = kfp.components.load_component_from_file('logistic_regression/logistic_regression.yaml')
random_forest = kfp.components.load_component_from_file('randomForest/random_forest.yaml')
# Run download_data task
download_task = add_resource_constraints(download())
# Run tasks "decison_tree" and "logistic_regression" given
# the output generated by "download_task".
decision_tree_task = add_resource_constraints(decision_tree(download_task.output))
logistic_regression_task = add_resource_constraints(logistic_regression(download_task.output))
random_forest_task = add_resource_constraints(random_forest(download_task.output))
# Given the outputs from "decision_tree" and "logistic_regression"
# the component "show_results" is called to print the results.
add_resource_constraints(show_results(decision_tree_task.output, logistic_regression_task.output, random_forest_task.output))
if __name__ == '__main__':
kfp.compiler.Compiler().compile(first_pipeline, 'three_pipelines.yaml')
# kfp.Client().create_run_from_pipeline_func(basic_pipeline, arguments={})
```
## 編譯

## 上傳pipeline
若成功執行
