# Machine Learning & Scikit-learn ###### tags: `Machine Learning` `Python` `Scikit-learn` --- ## Before Modeling ### Split Data * sklearn.model_selection.train_test_split(X, y, test_size=0.2) ### Data Normaizaton * sklearn.preprocessing.StandardScaler() * sklearn.preprocessing.MinMaxScaler() ### Validate Trained Model * model.predict * sklearn.metrics.mean_squared_error() * sklearn.metrics.r2_score() * sklearn.metrics.accuracy_score() * sklearn.metrics.confusion_matrix() --- ## Regression Supervised Learning ### Linear Regression * sklearn.linear_model.LinearRegression() * .fit * .coef_ ### Ploynomial Regression * sklearn.preprocessing.PolynomialFeatures(degree=n) * .transform * .fit * .fit_transform * .coef_ ### Ridge, Lasso, ElasticNet * sklearn.linear_model.Lasso(alpha=n) * sklearn.linear_model.Ridge(alpha=n) * sklearn.linear_model.ElasticNet(alpha=n, l1_ratio=m) --- ## Classification Supervised Learning ### Logistic Regression * sklearn.linear_model.LogisticRegression() ### K-Nearest Neighbor (KNN) * sklearn.neighbors.KNeighborsClassifier(n_neighbors=n) ### Decision Tree #### CART * sklearn.tree.DecisionTreeClassifier(max_depth=m) * sklearn.tree.DecisionTreeClassifier(criterion='gini',max_depth=m) #### ID3 * sklearn.tree.DecisionTreeClassifier(criterion='entropy',max_depth=m) ### Naive Bayes * sklearn.naive_bayes.GaussianNB() * sklearn.naive_bayes.MultinomialNB() * When using Multinomial Naive Bayes, the data must greater than 0 * Suggest to using Min-Max Scaler for Normalization ### Random Forests * sklearn.ensemble.RandomForestClassifier(max_depth=m, n_estimators=n) ### Support Vector Machine (SVM) * sklearn.svm.SVC(kernel='rbf', C=c) * sklearn.svm.SVC(kernel='poly', C=c) * sklearn.svm.SVC(kernel='linear', C=c) * kernel for mapping data to higher dimension * C for penalty parameter --- ## Model Selection ### K-fold Model Evaluation * sklearn.model_selection.cross_val_score(estimator, X, y, cv=n, scoring='accuracy') ### Learning Curve * sklearn.model_selection.learning_curve(estimator, X, y, cv=n, scoring='accuracy',train_sizes=array()) ### Grid Search * sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring = 'accuracy', cv=n) --- ## Classification Unsupervised Learning ### K-means * sklearn.cluster.KMeans(n_clusters=n) * .cluster_centers_ (centroids) * silhouette score * sklearn.metrics.silhouette_score(X, labels) ### DBSCAN * Density-based spatial clustering of applications with noise * sklearn.cluster.dbscan(X, eps=r, min_samples=n) ### EM * Expection-Maximization * sklearn.mixture.GaussianMixture(n_components=n) --- ## Dimension Reduction ### SVD * sklearn.decomposition.TruncatedSVD(n_components=n) ### PCA * sklearn.decomposition.PCA(n_components=n) ### T-SNE * sklearn.manifold.TSNE(n_components=n) --- ## Ensemble Learning ### Bagging method * sklearn.ensemble.BaggingClassifier(base_estimator=None, n_estimators=n) * sklearn.ensemble.BaggingRegressor(base_estimator=None, n_estimators=n) ### Boosting method * sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=n) ### XGBOOST * https://towardsdatascience.com/xgboost-python-example-42777d01001e * xgboost.xgbregressor(n_estimators=n,reg_lambda=1,gamma=0,max_depth=3) ### Averaging * https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html --- ## Feature Engineering ### Data Exploration ### Feature Cleaning ### Feature Engineering ### Feature Selection