DS - FUV PP - HackMD

# DS - FUV PP [TOC] ## Overall ### 1. CRISP-DM (Cross-Industry Standard Process for Data Mining) is a widely-used methodology for solving data-related problems, including data science projects. The process consists of six stages: **Business Understanding:** This stage involves understanding the business problem or objective and defining the project's goals and success criteria. **Data Understanding:** This stage involves collecting and exploring the relevant data to understand its quality, completeness, and potential biases. **Data Preparation:** This stage involves cleaning, transforming, and integrating the data to make it suitable for analysis. **Modeling:** This stage involves selecting the appropriate modeling techniques and building predictive models based on the prepared data. **Evaluation:** This stage involves evaluating the models' performance against the project's success criteria and refining them if necessary. **Deployment:** This stage involves deploying the final models or solutions and monitoring their performance over time. To successfully execute a data science project using CRISP-DM, a data scientist needs a broad set of skills, including: **Domain knowledge:** A data scientist should have a deep understanding of the business domain and the problem they are trying to solve. **Statistical analysis:** A data scientist should be proficient in statistical methods and be able to choose the right techniques for the given problem. **Programming:** A data scientist should be proficient in programming languages such as Python or R, as well as data manipulation and visualization libraries. **Machine learning:** A data scientist should be familiar with various machine learning techniques and be able to apply them to real-world problems. **Communication:** A data scientist should be able to effectively communicate their findings and insights to stakeholders and team members. **Personal experience:** As a data scientist for a retail firm, I utilized CRISP-DM to evaluate consumer behavior data and create a focused marketing strategy. I collaborated closely with the marketing team to understand their goals and objectives, and we decided to boost the conversion rate for a certain product category. We discovered that client demographics and purchase history were the most predictive characteristics after gathering and analyzing the relevant data. We next cleaned and transformed the data before using machine learning techniques like logistic regression and decision trees to develop prediction models. We compared the performance of the models using criteria such as accuracy, precision, and recall, and chose the best-performing model for deployment. The ultimate approach entailed developing targeted marketing campaigns for customers based on their purchase history and demographic data. I used my domain expertise in retail, statistical analysis, programming, and machine learning throughout the project to create a successful solution that enhanced the conversion rate for the target product category. I also presented the results and insights to the marketing team in a clear and straightforward manner, assisting them in understanding the campaign's effect. ### 2. Parametric and non-parametric models are two different approaches for modeling data in statistical analysis. Parametric models make assumptions about the underlying distribution of the data and estimate the parameters of that distribution. These models are characterized by a fixed set of parameters that do not change with the size of the dataset. They typically have higher predictive accuracy when the assumptions are true, but can perform poorly if the underlying assumptions are violated. Examples of parametric models include: **Linear regression:** Assumes a linear relationship between the independent and dependent variables. **Logistic regression:** Assumes a logistic function to predict a binary outcome. **Gaussian Naive Bayes:** Assumes a Gaussian distribution of the features. Non-parametric models, on the other hand, do not make any assumptions about the underlying distribution of the data. Instead, they learn the mapping between the input features and output variable directly from the data. They are typically more flexible and can be used for a wide range of problems, but may require more data to achieve the same level of accuracy as parametric models. Examples of non-parametric models include: **K-nearest neighbors:** Estimates the output value based on the K closest data points in the training set. **Decision trees:** Builds a tree-like model of decisions and their possible consequences. **Random forest:** An ensemble of decision trees that aggregates their predictions. In general, the choice between parametric and non-parametric models depends on the specific problem, the amount and quality of available data, and the assumptions about the underlying distribution of the data. Parametric models are useful when the underlying distribution is known or can be reasonably assumed, while non-parametric models are more suitable when the distribution is unknown or cannot be assumed. For example, if we wish to predict a student's grade in a class based on their study time, a linear regression model, assuming a linear connection between study time and grade, would be an appropriate parametric model. However, if we wish to categorize photographs of handwritten digits, a non-parametric K-nearest neighbors model would be preferable, because the link between the pixels in the image and the corresponding digit may be complicated and difficult to express parametrically. ### 3. Supervised and unsupervised models are two major types of machine learning algorithms that are used to analyze and model datasets. Supervised learning algorithms learn to make predictions or classify new data based on labeled examples in the training data. The input data is accompanied by known output values, and the algorithm learns a mapping function from input variables to output variables. The model is trained on a labeled dataset and then used to make predictions on new, unseen data. Examples of supervised learning algorithms include: **Linear regression:** Predicts a continuous output variable based on one or more input variables. **Logistic regression:** Classifies data into one of two categories based on one or more input variables. **Support Vector Machines (SVM):** Classifies data by finding the best decision boundary that separates the two classes. Unsupervised learning algorithms, on the other hand, do not rely on labeled data for training. These models identify hidden patterns or structures in the data without any prior knowledge of the output. These algorithms cluster the data based on similarity and find meaningful structures in the data. Examples of unsupervised learning algorithms include: **K-means clustering:** Divides a dataset into K clusters based on similarity of data points. **Principal Component Analysis (PCA):** Reduces the dimensionality of the data while retaining the most important features. **Anomaly detection:** Identifies unusual data points or outliers in the dataset. In general, supervised learning is useful for problems where the output is known and the goal is to learn a mapping function from input to output. This includes problems like classification and regression. Unsupervised learning is useful for discovering hidden patterns or structures in the data that can be used for further analysis or to guide other tasks. In a credit card fraud detection system, for example, supervised learning algorithms might be used to learn patterns of fraudulent transactions based on labeled samples in the training dataset. In a customer segmentation challenge, on the other hand, we may utilize unsupervised learning algorithms to categorize clients based on their purchase activity and uncover common patterns or customer segments. ### 4. Supervised learning algorithms can be used for both forecasting and explanatory inference tasks, but the specific model used and the approach taken will depend on the goal of the analysis. Forecasting models are used to predict future outcomes based on historical data. The goal is to identify patterns or trends in the data that can be used to make accurate predictions about future events. These models are typically evaluated based on their predictive accuracy and ability to generalize to new data. Examples of supervised learning algorithms used for forecasting include: Time series models, such as ARIMA and exponential smoothing, which use historical data to forecast future values of a variable over time. Random forest regression, which can be used to forecast continuous outcomes, such as stock prices or sales volume. Recurrent neural networks (RNNs), which can learn patterns in sequential data, such as text or speech, to make predictions about future events. In contrast, explanatory inference models aim to identify causal relationships between variables and explain why certain outcomes occur. The goal is to understand the underlying mechanisms that drive the observed data and identify the factors that contribute to a particular outcome. Examples of supervised learning algorithms used for explanatory inference include: Linear regression, which can be used to identify the relationship between two continuous variables and quantify the effect of one variable on the other. Logistic regression, which can be used to identify the factors that contribute to a binary outcome, such as whether a customer will buy a product or not. Decision trees, which can be used to identify the most important variables and decision points that lead to a particular outcome. A healthcare environment provides an example of forecasting and explanatory inference in a supervised learning situation. Assume we wish to forecast hospital readmissions and identify the factors that influence them. We may use a time series model to anticipate the number of readmissions in the following months based on prior data for the forecasting task. We might apply a logistic regression model to identify the factors that lead to readmissions, such as age, comorbidities, and hospital length of stay, for the explanatory inference job. The logistic regression results might assist healthcare practitioners in determining which individuals are at the greatest risk of readmission and identifying actions that could lower the risk of readmission for these patients. ## Techniques ### 1. When building a supervised learning model, there are two types of errors that can occur: reducible and non-reducible errors. Reducible errors are errors that can be minimized or eliminated by improving the model itself. These errors occur due to factors such as biased data, overfitting, underfitting, or improper feature selection. With proper data preparation, feature engineering, and model selection, these errors can be reduced or eliminated. Examples of reducible errors include: **Bias in the data:** Suppose we are building a model to predict whether a customer will churn or not. If the training data contains more examples of non-churning customers than churning customers, the model may be biased towards predicting non-churners, leading to errors in prediction for churners. **Overfitting:** Suppose we are building a model to predict housing prices based on square footage and number of bedrooms. If we have a small dataset and we try to fit a complex model that is too flexible, the model may overfit to the training data and not generalize well to new data, leading to errors in prediction. **Underfitting:** Suppose we are building a model to predict credit card fraud based on a few basic features such as transaction amount and location. If we use a simple linear model that does not capture the complexity of the fraud patterns, the model may underfit the data and not be able to accurately predict fraud, leading to errors in prediction. Non-reducible errors, also known as irreducible errors, are errors that cannot be eliminated by improving the model itself. These errors occur due to factors that are outside the control of the model, such as measurement errors, data collection errors, or random noise in the data. These errors can only be minimized by increasing the amount of data or improving the quality of the data. Examples of non-reducible errors include: **Measurement errors:** Suppose we are building a model to predict the weather based on historical data. If the data was collected using imperfect sensors or instruments, there may be measurement errors that cannot be eliminated by improving the model. **Random noise:** Suppose we are building a model to predict stock prices based on historical data. If there is random noise in the market that cannot be predicted, the model may not be able to accurately predict stock prices, leading to non-reducible errors. In general, reducible errors can be addressed by improving the model itself, while non-reducible errors can only be minimized by improving the quality and quantity of the data used in the model. Understanding the sources of both reducible and non-reducible errors is important when building and evaluating supervised learning models. ### 2. The bias-variance trade-off is a fundamental concept in machine learning that refers to the relationship between model complexity, bias, and variance. In general, increasing the complexity of a model leads to a reduction in bias but an increase in variance, and vice versa. Bias is the difference between the predicted output of a model and the true output. A model with high bias tends to underfit the data, which means it is too simplistic and cannot capture the underlying patterns in the data. Variance is the variability of model predictions for a given input. A model with high variance tends to overfit the data, which means it is too complex and fits the noise in the data rather than the underlying patterns. The goal of machine learning is to find a model that balances bias and variance to achieve good generalization performance on new data. For example, suppose we are building a model to predict housing prices based on the square footage of the house. We can fit different models with varying degrees of complexity, from a simple linear regression to a high-degree polynomial regression. If we use a simple linear regression model, it may have high bias but low variance. This means that the model may underfit the data and have a large prediction error, but the prediction error will be consistent across different data samples. If we use a high-degree polynomial regression model, it may have low bias but high variance. This means that the model may fit the training data well but will be overly sensitive to noise and have a high prediction error on new data. To find the optimal balance between bias and variance, we can use techniques such as cross-validation, regularization, or ensemble methods. These techniques help us to tune the model complexity to achieve the best trade-off between bias and variance and improve the generalization performance of the model on new data. ### 3. Overfitting is a common problem in machine learning where a model is trained too well on the training data, such that it learns the noise in the data rather than the underlying patterns. This leads to poor generalization performance on new, unseen data, which is the ultimate goal of any machine learning model. To illustrate overfitting, consider the example of building a model to predict the presence of breast cancer based on a set of features such as age, tumor size, and malignancy score. We can use a simple logistic regression model to fit the data and achieve good accuracy on the training set. However, if we continue to increase the complexity of the model, such as adding more features or using a non-linear model like a decision tree or a neural network, we may start to overfit the data. The model may start to memorize the training data and fit the noise, rather than capturing the underlying patterns in the data. As a result, the model may achieve near-perfect accuracy on the training set but perform poorly on new, unseen data. This means that the model has failed to generalize and is not useful for making accurate predictions on new data. To prevent overfitting, we can use techniques such as cross-validation, regularization, or early stopping. Cross-validation helps us to evaluate the performance of the model on new data samples, while regularization helps us to penalize complex models and simplify the model to avoid overfitting. Early stopping is another technique that stops the training process before the model starts to overfit the data. By using these techniques, we can find the optimal balance between model complexity and generalization performance, and build models that can make accurate predictions on new, unseen data. ### 4. Cross-validation is a technique used in machine learning to evaluate the performance of a model on new, unseen data. There are several methods for cross-validation, but three commonly used methods are k-fold cross-validation, leave-one-out cross-validation, and stratified cross-validation. **K-fold cross-validation:** K-fold cross-validation is a technique that divides the data into k equal-sized subsets (folds) and performs k iterations of training and testing the model. In each iteration, one fold is used as the test set, and the remaining k-1 folds are used as the training set. The performance of the model is then evaluated by averaging the results of k iterations. For example, suppose we have a dataset of 1000 samples and we use 5-fold cross-validation. The data is divided into 5 subsets of 200 samples each, and we perform 5 iterations of training and testing the model. In each iteration, one subset is used as the test set, and the remaining 4 subsets are used as the training set. **Leave-One-Out Cross-Validation:** Leave-one-out cross-validation is a technique that performs k-fold cross-validation where k is equal to the number of samples in the dataset. In each iteration, one sample is used as the test set, and the remaining samples are used as the training set. The performance of the model is then evaluated by averaging the results of k iterations. For example, suppose we have a dataset of 100 samples, and we use leave-one-out cross-validation. In each iteration, one sample is used as the test set, and the remaining 99 samples are used as the training set. We perform 100 iterations of training and testing the model. **Stratified cross-validation:** Stratified cross-validation is a technique used when the dataset is imbalanced, meaning that one class of samples is more prevalent than another. In this method, the data is divided into k folds, as in k-fold cross-validation. However, the distribution of the samples in each fold is kept similar to the distribution of the samples in the entire dataset. This ensures that each fold has a representative sample of each class. For example, suppose we have a dataset of 1000 samples, of which 900 are in class A and 100 are in class B. In stratified k-fold cross-validation, the data is divided into k folds, and the ratio of samples in each class is kept similar to the overall ratio of the dataset. Overall, the choice of cross-validation method depends on the size of the dataset, the distribution of the samples, and the goals of the analysis. K-fold cross-validation is a widely used method that provides a good balance between computational efficiency and accuracy. Leave-one-out cross-validation is more computationally expensive but provides a more accurate estimate of the model's performance. Stratified cross-validation is useful for imbalanced datasets where we want to ensure that each fold has a representative sample of each class. ### 5. Data leakage occurs when information from outside the training data is used to train a machine learning model. This can lead to over-optimistic performance estimates and poor generalization performance on new, unseen data. Nested cross-validation is a technique used to prevent data leakage and obtain an unbiased estimate of a model's performance. It involves using an outer loop and an inner loop of cross-validation. To illustrate data leakage and nested cross-validation, consider the example of building a model to predict house prices based on a set of features such as square footage, number of bedrooms, and location. Suppose we have a dataset with 1000 samples and we want to evaluate the performance of a regression model using cross-validation. One common mistake is to perform cross-validation on the entire dataset and then use the same data to train the final model. This can lead to data leakage since the model has already seen the test data during cross-validation and may overfit the data. Nested cross-validation can be used to prevent data leakage and obtain an unbiased estimate of the model's performance. In this method, we use an outer loop of cross-validation to split the data into training and testing sets. We then use an inner loop of cross-validation to tune the hyperparameters of the model using only the training data. For example, suppose we use 5-fold cross-validation for the outer loop and 3-fold cross-validation for the inner loop. In each iteration of the outer loop, we split the data into 80% training and 20% testing sets. We then use the training set to perform 3-fold cross-validation to tune the hyperparameters of the model. Finally, we use the best hyperparameters to train the model on the entire training set and evaluate its performance on the test set. By using nested cross-validation, we ensure that the model is evaluated on new, unseen data during the outer loop of cross-validation. This provides an unbiased estimate of the model's performance and helps to prevent data leakage. The inner loop of cross-validation is used to tune the hyperparameters of the model, and the final model is trained on the entire training set using the best hyperparameters. ### 6. K-means clustering and hierarchical clustering are two popular unsupervised learning techniques used for clustering analysis. They are used to group data points into clusters based on similarity or distance between them. **K-means clustering:** K-means clustering is an iterative algorithm that partitions a dataset into K clusters, where K is a pre-specified number. It works by minimizing the sum of squared distances between data points and their assigned cluster centroids. The algorithm randomly initializes K cluster centroids and iteratively assigns each data point to the nearest centroid, updates the centroid's location based on the mean of the data points assigned to it, and repeats until convergence. Example: Suppose we have a dataset of 1000 customers with their age and income information. We want to cluster these customers into three groups based on their similarity in age and income. We can use k-means clustering to partition the data into three clusters, where each cluster represents a group of customers with similar age and income. **Hierarchical clustering:** Hierarchical clustering is a clustering algorithm that creates a hierarchy of clusters by recursively merging or splitting them based on a similarity or distance metric. It can be divided into two types: agglomerative and divisive. Agglomerative hierarchical clustering starts with each data point in its own cluster and iteratively merges the two closest clusters until all data points belong to a single cluster. Divisive hierarchical clustering starts with all data points in a single cluster and recursively splits it into smaller clusters until each data point belongs to its own cluster. Example: Suppose we have a dataset of 1000 patients with their medical information such as blood pressure, cholesterol level, and BMI. We want to cluster these patients into groups based on their medical conditions. We can use hierarchical clustering to create a dendrogram that shows the hierarchy of clusters and visualize the relationships between different clusters. The dendrogram can help us decide how many clusters to use and which patients belong to each cluster based on the similarity of their medical conditions.