# Kubeflow implementation:add Random Forest algorithm ## Table of contents [TOC] ## Reference :::warning https://towardsdatascience.com/kubeflow-pipelines-how-to-build-your-first-kubeflow-pipeline-from-scratch-2424227f7e5 ::: https://github.com/FernandoLpz/Kubeflow_Pipelines https://hub.docker.com/r/fernandolpz/only-tests/tags https://hub.docker.com/repository/docker/lightnighttw/kubeflow/general ## Architecture :::info Each block represents a component, and each component is a container. ::: ![result](https://i.imgur.com/pql27HJ.png) # Random Forest Image ## Build new folder randomForest ![](https://i.imgur.com/rgxdiwu.png) ## Dockerfile ```gherkin= FROM python:3.8-slim WORKDIR /pipelines COPY requirements.txt /pipelines RUN pip install -r requirements.txt COPY randomforest.py /pipelines ``` ## ramdom_forest.yaml ```gherkin= name: Random Forest classifier description: Train a random forest classifier inputs: - {name: Data, type: LocalPath, description: 'Path where data is stored.'} outputs: - {name: Accuracy, type: Float, description: 'Accuracy metric'} implementation: container: image: lightnighttw/kubeflow:random_forest_v4 command: [ python, randomforest.py, --data, {inputPath: Data}, --accuracy, {outputPath: Accuracy}, ] ``` ## randomforest.py ```gherkin= import json import argparse from pathlib import Path from sklearn.metrics import accuracy_score from sklearn.ensemble import RandomForestClassifier def _randomforest(args): # Open and reads file "data" with open(args.data) as data_file: data = json.load(data_file) # The excted data type is 'dict', however since the file # was loaded as a json object, it is first loaded as a string # thus we need to load again from such string in order to get # the dict-type object. data = json.loads(data) x_train = data['x_train'] y_train = data['y_train'] x_test = data['x_test'] y_test = data['y_test'] # Initialize and train the model model = RandomForestClassifier(n_estimators=100, criterion = 'gini') model.fit(x_train, y_train) # Get predictions y_pred = model.predict(x_test) # Get accuracy accuracy = accuracy_score(y_test, y_pred) # Save output into file with open(args.accuracy, 'w') as accuracy_file: accuracy_file.write(str(accuracy)) if __name__ == '__main__': # Defining and parsing the command-line arguments parser = argparse.ArgumentParser(description='My program description') parser.add_argument('--data', type=str) parser.add_argument('--accuracy', type=str) args = parser.parse_args() # Creating the directory where the output file will be created (the directory may or may not exist). Path(args.accuracy).parent.mkdir(parents=True, exist_ok=True) _randomforest(args) ``` ## requirements.txt :::danger In the new version of pip, sklearn can't be used. You need to use pip install scikit-learn. ::: ```gherkin= scikit-learn ``` ## Upload to Docker Hub [Install docker](https://www.docker.com) In ramdom_forest.yaml, the image access location is defined as lightnighttw/kubeflow:random_forest_v4 (which needs to be customized). [Docker Hub](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwjwy-SU35f-AhWRU94KHU9aDq4QFnoECBEQAQ&url=https%3A%2F%2Fhub.docker.com%2F&usg=AOvVaw08oOGeIqomrCmPs9p07hDk) If you don't have an account, apply for one. ```= # Download the above four files into a new folder on your computer. # login docker login -u <docker username> # docker build docker build --platform=linux/amd64 -t <docker-registry-username>/<docker-image-name>:<tag_name> . -f Dockerfile # 推送到Docker Hub docker push <docker-registry-username>/<docker-image-name>:<tag_name> Ex: docker push lightnighttw/kubeflow:random_forest_v4 ``` If it doesn't work, use mine [image](https://hub.docker.com/repository/docker/lightnighttw/kubeflow/general) # Create a pipeline ## pipeline.py :::info Import all the information (yaml) of the previous components and compile it into a pipeline yaml file. ::: ```gherkin= import kfp from kfp import dsl from kfp.components import func_to_container_op @func_to_container_op def show_results(decision_tree : float, logistic_regression : float, random_forest : float) -> None: # Given the outputs from decision_tree and logistic regression components # the results are shown. print(f"Decision tree (accuracy): {decision_tree}") print(f"Logistic regression (accuracy): {logistic_regression}") print(f"Random forest (accuracy): {random_forest}") def add_resource_constraints(op: dsl.ContainerOp): return op.set_cpu_request("1").set_cpu_limit("2") @dsl.pipeline(name='Three Pipeline', description='Applies Decision Tree, random forest and Logistic Regression for classification problem.') def first_pipeline(): # Loads the yaml manifest for each component download = kfp.components.load_component_from_file('download_data/download_data.yaml') decision_tree = kfp.components.load_component_from_file('decision_tree/decision_tree.yaml') logistic_regression = kfp.components.load_component_from_file('logistic_regression/logistic_regression.yaml') random_forest = kfp.components.load_component_from_file('randomForest/random_forest.yaml') # Run download_data task download_task = add_resource_constraints(download()) # Run tasks "decison_tree" and "logistic_regression" given # the output generated by "download_task". decision_tree_task = add_resource_constraints(decision_tree(download_task.output)) logistic_regression_task = add_resource_constraints(logistic_regression(download_task.output)) random_forest_task = add_resource_constraints(random_forest(download_task.output)) # Given the outputs from "decision_tree" and "logistic_regression" # the component "show_results" is called to print the results. add_resource_constraints(show_results(decision_tree_task.output, logistic_regression_task.output, random_forest_task.output)) if __name__ == '__main__': kfp.compiler.Compiler().compile(first_pipeline, 'three_pipelines.yaml') # kfp.Client().create_run_from_pipeline_func(basic_pipeline, arguments={}) ``` ## Compile ![compilier](https://i.imgur.com/EvfJExY.png) ## upload pipeline If you success ![test](https://i.imgur.com/pql27HJ.png)