Kubeflow實作筆記：新增 Random Forest 演算法

# Kubeflow實作筆記：新增 Random Forest 演算法 ## 目錄 [TOC] ## 參考資料 :::warning 建議先看： https://towardsdatascience.com/kubeflow-pipelines-how-to-build-your-first-kubeflow-pipeline-from-scratch-2424227f7e5 ::: https://github.com/FernandoLpz/Kubeflow_Pipelines https://hub.docker.com/r/fernandolpz/only-tests/tags https://hub.docker.com/repository/docker/lightnighttw/kubeflow/general ## 架構每個方塊代表一個組件，每個組件都是一個容器。 ![result](https://i.imgur.com/pql27HJ.png) # Random Forest Image ## 建立新資料夾randomForest ![](https://i.imgur.com/rgxdiwu.png) ## Dockerfile ```gherkin= FROM python:3.8-slim WORKDIR /pipelines COPY requirements.txt /pipelines RUN pip install -r requirements.txt COPY randomforest.py /pipelines ``` ## ramdom_forest.yaml ```gherkin= name: Random Forest classifier description: Train a random forest classifier inputs: - {name: Data, type: LocalPath, description: 'Path where data is stored.'} outputs: - {name: Accuracy, type: Float, description: 'Accuracy metric'} implementation: container: image: lightnighttw/kubeflow:random_forest_v4 command: [ python, randomforest.py, --data, {inputPath: Data}, --accuracy, {outputPath: Accuracy}, ] ``` ## randomforest.py ```gherkin= import json import argparse from pathlib import Path from sklearn.metrics import accuracy_score from sklearn.ensemble import RandomForestClassifier def _randomforest(args): # Open and reads file "data" with open(args.data) as data_file: data = json.load(data_file) # The excted data type is 'dict', however since the file # was loaded as a json object, it is first loaded as a string # thus we need to load again from such string in order to get # the dict-type object. data = json.loads(data) x_train = data['x_train'] y_train = data['y_train'] x_test = data['x_test'] y_test = data['y_test'] # Initialize and train the model model = RandomForestClassifier(n_estimators=100, criterion = 'gini') model.fit(x_train, y_train) # Get predictions y_pred = model.predict(x_test) # Get accuracy accuracy = accuracy_score(y_test, y_pred) # Save output into file with open(args.accuracy, 'w') as accuracy_file: accuracy_file.write(str(accuracy)) if __name__ == '__main__': # Defining and parsing the command-line arguments parser = argparse.ArgumentParser(description='My program description') parser.add_argument('--data', type=str) parser.add_argument('--accuracy', type=str) args = parser.parse_args() # Creating the directory where the output file will be created (the directory may or may not exist). Path(args.accuracy).parent.mkdir(parents=True, exist_ok=True) _randomforest(args) ``` ## requirements.txt :::danger sklearn在新版的pip已經不能用，需要pip install scikit-learn ::: ```gherkin= scikit-learn ``` ## 上傳到Docker Hub [安裝docker](https://www.docker.com) 在ramdom_forest.yaml定義了image的取用位址lightnighttw/kubeflow:random_forest_v4(需自訂) [Docker Hub](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjwy-SU35f-AhWRU94KHU9aDq4QFnoECBEQAQ&url=https%3A%2F%2Fhub.docker.com%2F&usg=AOvVaw08oOGeIqomrCmPs9p07hDk) 沒帳號就申請一個 ```gherkin= #將以上四個檔案下載到自己的電腦的新資料夾裡 #登入 docker login -u "docker用户名" #docker build docker build --platform=linux/amd64 -t <docker-registry-username>/<docker-image-name>:<tag_name> . -f Dockerfile #推送到Docker Hub docker push <docker-registry-username>/<docker-image-name>:<tag_name> Ex: docker push lightnighttw/kubeflow:random_forest_v4 ``` 如果不行就用我的[image](https://hub.docker.com/repository/docker/lightnighttw/kubeflow/general) # 建立pipeline ## pipeline.py 把前面所有組建的資訊(yaml)引入編譯成管線的yaml檔 ```gherkin= import kfp from kfp import dsl from kfp.components import func_to_container_op @func_to_container_op def show_results(decision_tree : float, logistic_regression : float, random_forest : float) -> None: # Given the outputs from decision_tree and logistic regression components # the results are shown. print(f"Decision tree (accuracy): {decision_tree}") print(f"Logistic regression (accuracy): {logistic_regression}") print(f"Random forest (accuracy): {random_forest}") def add_resource_constraints(op: dsl.ContainerOp): return op.set_cpu_request("1").set_cpu_limit("2") @dsl.pipeline(name='Three Pipeline', description='Applies Decision Tree, random forest and Logistic Regression for classification problem.') def first_pipeline(): # Loads the yaml manifest for each component download = kfp.components.load_component_from_file('download_data/download_data.yaml') decision_tree = kfp.components.load_component_from_file('decision_tree/decision_tree.yaml') logistic_regression = kfp.components.load_component_from_file('logistic_regression/logistic_regression.yaml') random_forest = kfp.components.load_component_from_file('randomForest/random_forest.yaml') # Run download_data task download_task = add_resource_constraints(download()) # Run tasks "decison_tree" and "logistic_regression" given # the output generated by "download_task". decision_tree_task = add_resource_constraints(decision_tree(download_task.output)) logistic_regression_task = add_resource_constraints(logistic_regression(download_task.output)) random_forest_task = add_resource_constraints(random_forest(download_task.output)) # Given the outputs from "decision_tree" and "logistic_regression" # the component "show_results" is called to print the results. add_resource_constraints(show_results(decision_tree_task.output, logistic_regression_task.output, random_forest_task.output)) if __name__ == '__main__': kfp.compiler.Compiler().compile(first_pipeline, 'three_pipelines.yaml') # kfp.Client().create_run_from_pipeline_func(basic_pipeline, arguments={}) ``` ## 編譯 ![compilier](https://i.imgur.com/EvfJExY.png) ## 上傳pipeline 若成功執行 ![test](https://i.imgur.com/pql27HJ.png)