Wine dataset - HackMD

# Wine dataset ## I. Introduction Wine quality holds great importance for consumers and production industries. To enhance sales, industries increasingly rely on product quality certification. Wine, being a globally consumed beverage, benefits from this certification to enhance its market value. In the past, quality testing was a time-consuming process conducted after production, requiring extensive resources, such as a diverse team of human experts for quality assessment, resulting in high costs. Moreover, the subjective nature of human opinion makes it challenging to identify wine quality based solely on expert input. The wine quality dataset, publicly available on the UCI machine learning repository [^1], consists of two files: red wine and white wine, focusing on the Portuguese "Vinho Verde" wine variants. This dataset is part of a vast collection extensively used by the machine learning community. The red wine dataset comprises 1599 instances, while the white wine dataset contains 4898 instances. Both datasets share 11 input features and one output feature. The input features are derived from physicochemical tests, while the output feature is based on sensory data and categorized into 11 quality classes ranging from 0 (representing 'very bad') to 10 (representing 'very good'). For the purpose of this report, the analysis will be conducted specifically on the white wine dataset. ## II. Dataset Exploration ### 1. Data Visualization In order to gain a deeper understanding of the Wine dataset, we will now dive into the process of data visualization. This phase allows us to visualize and interpret the data in a graphical format, providing valuable insights and aiding in the identification of patterns, trends, and potential outliers. One common approach to data visualization is to create various types of plots and charts that represent the distribution and relationships between different variables. Descriptive statistics, such as histograms and box plots, can provide a visual representation of the distribution of each attribute in the dataset. These visualizations allow us to identify the range, central tendency, and spread of the data for each attribute. Correlation analysis is another useful technique in exploring the Wine dataset. By calculating the correlation coefficients between pairs of attributes, we can determine the strength and direction of their relationships. Scatter plots can be employed to visually display these relationships, where the position of data points on the plot indicates the values of the two attributes and their correlation. Outlier detection is crucial in understanding the data quality and potential anomalies. Through visualization, we can identify any data points that deviate significantly from the majority, indicating potential outliers. These outliers may need to be further examined and potentially addressed during the data preprocessing stage. Lastly, dimensionality reduction techniques, such as principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE), can be applied to visualize high-dimensional data in a lower-dimensional space. These techniques help in identifying clusters, patterns, and the separability of different classes within the dataset. By employing these data visualization techniques, we aim to gain meaningful insights into the Wine dataset, uncover hidden patterns, and address any potential issues that may arise, ultimately paving the way for effective model training and analysis. #### a. Histograms of Wine Dataset Attributes Histograms provides * a visual representation of the distribution of each attribute in the Wine dataset, * allowing us to understand the range, * central tendency, and * spread of the data for each attribute. * The histogram displays the frequency or count of data points within predefined bins or intervals, providing insights into the distribution shape and potential outliers. ![](https://hackmd.io/_uploads/ByBOoVyS3.png) The histogram above presents the distribution of each attribute in the Wine dataset. But, first, let us explore some key observations: :::success **Fixed Acidity**: Most wines in the dataset have a fixed acidity level ranging from approximately 6 to 8 g/dm³. Volatile Acidity: The distribution of volatile acidity appears to be slightly right-skewed, with most wines having volatile acidity values below 0.5 g/dm³. There might be a few outliers on the higher end of the scale. **Citric Acid**: The citric acid distribution shows a peak around 0.4 g/dm³, indicating a significant proportion of wines with moderate citric acid content. **Residual Sugar**: The data is skewed to the right, with most wines containing less than 10 g/dm³ of residual sugar. However, there seem to be some outliers with high sugar content. **Chlorides**: The majority of wines have chloride levels below 0.1 g/dm³, but there are a few wines with higher chloride content, potentially indicating outliers. **Free Sulfur Dioxide**: The distribution of free sulfur dioxide shows a peak around 30 mg/dm³, with a gradual decline as the concentration increases. Some outliers might exist at higher sulfur dioxide levels. **Total Sulfur Dioxide**: Similar to free sulfur dioxide, the total sulfur dioxide distribution exhibits a peak of around 125 mg/dm³, with a long tail to the right. Outliers might be present at higher concentrations. **Density**: The density values are around 0.99 g/cm³, with a few outliers on both ends. **pH**: The pH values are primarily within the range of 3 to 3.5, indicating a slightly acidic nature for the wines. **Sulfates**: The distribution of sulfates shows a peak around 0.5 g/dm³, indicating a moderate concentration in most wines. **Alcohol**: The alcohol content of wines appears to be normally distributed, with a peak around 9.5% to 10% vol. ::: #### b. Box Plots of Wine Dataset Attributes Box plots provide a visual summary of the distribution, central tendency, and variability of each attribute in the Wine dataset. They allow us to identify potential outliers, understand the spread of the data, and compare the attributes' characteristics. ![](https://hackmd.io/_uploads/SkHG0VyHh.png) The box plot above presents the distribution of each attribute in the Wine dataset. :::warning **Fixed Acidity**: The median value (marked by the red line within the box) is approximately 6.8 g/dm³. The distribution appears to be slightly right-skewed, with a few potential outliers on the higher end of the scale. **Volatile Acidity**: The median value is around 0.26 g/dm³. The distribution shows a significant number of potential outliers, represented by individual points outside the whiskers. **Citric Acid**: The median value is approximately 0.32 g/dm³. The data is skewed to the right, with a few outliers on the higher end. **Residual Sugar**: The median value is around 6 g/dm³. The distribution is highly skewed to the right, indicating a concentration of wines with low sugar content. There seem to be several outliers with significantly higher sugar levels. **Chlorides**: The median value is approximately 0.04 g/dm³. The data is slightly right-skewed, with a few potential outliers on the higher end. **Free Sulfur Dioxide**: The median value is around 35 mg/dm³. The distribution exhibits some outliers on the higher end, but the spread of data appears relatively symmetrical. **Total Sulfur Dioxide**: The median value is approximately 134 mg/dm³. Similar to free sulfur dioxide, the distribution shows outliers on the higher end, indicating potential data points with higher concentrations. **Density**: The median value is around 0.99 g/cm³. The data is relatively symmetrical, with no visible outliers. **pH**: The median value is approximately 3.19. The distribution appears to be normally distributed, with no visible outliers. **Sulphates**: The median value is around 0.47 g/dm³. The distribution shows a few potential outliers on the higher end. **Alcohol**: The median value is approximately 10.4% vol. The data is relatively symmetrical, with no visible outliers. ::: #### c. Correlation Matrix of Wine Dataset Attributes The correlation matrix provides a visual representation of the pairwise correlations between attributes in the Wine dataset. It helps us understand the relationships and dependencies between different attributes, highlighting potential patterns and dependencies. ![](https://hackmd.io/_uploads/HJarAVJSh.png) The correlation matrix plot above shows the correlation coefficients between each pair of attributes in the Wine dataset. :::info In chemical interactions, there are various correlations between different substances. For example, the relationship between fixed acidity and other substances, such as density and total sulfur dioxide, is positively correlated. This suggests that fixed acidity levels are typically accompanied by proportional increases in density and total sulfur dioxide. Meanwhile, volatile acidity demonstrates a contrasting relationship with pH and citric acid. It exhibits a strong negative correlation with pH, implying that pH tends to decrease significantly as volatile acidity increases. Furthermore, volatile acidity also shows a moderate negative correlation with citric acid. This indicates that higher levels of volatile acidity are generally associated with somewhat lower levels of citric acid. Citric acid, on the other hand, displays a moderate positive correlation with pH and a weak positive correlation with alcohol. This means that as the concentration of citric acid rises, there is usually a moderate increase in pH and a slight increase in alcohol. Residual sugar bears a weak positive correlation with density, suggesting that changes in residual sugar are likely to result in minor increases in density. However, there is a weak negative correlation between residual sugar and alcohol, indicating that as residual sugar increases, alcohol levels slightly decrease. When it comes to chlorides, they exhibit a moderate positive correlation with both density and total sulfur dioxide. This implies that as chloride levels rise, we can expect a corresponding moderate increase in density and total sulfur dioxide. Free sulfur dioxide shows a weak positive correlation with total sulfur dioxide. This suggests that increases in free sulfur dioxide are generally associated with slight increases in total sulfur dioxide. Total sulfur dioxide exhibits a moderate positive correlation with both density and chlorides, suggesting that as the amount of total sulfur dioxide increases, there is a proportional rise in density and chlorides. Interestingly, density shows a strong positive correlation with alcohol, indicating that as density increases, alcohol levels also increase significantly. pH shows a complex relationship with citric acid and fixed acidity. It exhibits a moderate negative correlation with citric acid and a weak negative correlation with fixed acidity. This implies that as pH levels rise, there is a moderate decrease in citric acid and a slight decrease in fixed acidity. Sulfates display a weak positive correlation with alcohol, meaning that as sulfate levels rise, there is a slight corresponding increase in alcohol levels. Lastly, alcohol exhibits a strong negative correlation with density and a weak positive correlation with citric acid and sulfates. This suggests that as alcohol levels increase, density significantly decreases while slight increases in both citric acid and sulfates. ::: Analyzing the correlation matrix helps us understand the relationships between attributes in the Wine dataset. For example, we observe a strong negative correlation between Volatile Acidity and pH, indicating that wines with higher volatile acidity tend to have lower pH values. Additionally, the positive correlation between Citric Acid and pH suggests that wines with higher citric acid content also tend to have higher pH values. #### d. PCA: Dimensionality Reduction The Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of potentially correlated variables into a set of linearly uncorrelated variables called principal components. This transformation is defined so that the first principal component has the most considerable possible variance, and each succeeding component, in turn, has the highest variance possible under the constraint that it is orthogonal to the primary components. The principal components are orthogonal because they are the eigenvectors of the covariance matrix, which is symmetric. PCA is sensitive to the relative scaling of the original variables. Therefore, in the Wine dataset, where measurements are taken in different units (such as g/dm³ for various acids, mg/dm³ for sulfur dioxide, and % vol for alcohol), data standardization becomes essential before applying PCA. This process ensures that each attribute contributes equally to the principal components. Applying PCA to the Wine dataset serves two primary purposes. First, it allows us to reduce the dimensionality of the data while retaining as much of the variance in the dataset as possible. In simpler terms, it provides a way to condense the information in the many original attributes into fewer new variables (principal components), making it easier to visualize and understand the data. Second, PCA can help us identify potential clusters in the data. For example, if wines of different quality ratings are distinct regarding their physicochemical attributes, they might appear as separate clusters when the data are plotted according to their principal components. ![](https://hackmd.io/_uploads/ryXsj_Jrn.png) The scatter plot in Figure 'PCA: Dimensionality Reduction' visualizes the Wine dataset in the space of the first two principal components. In this 2D space, each point corresponds to a wine sample. The position of a point is determined by the values of the principal components, which are linear combinations of the original attributes. From the scatter plot, we can observe some degree of clustering in the data. However, it is essential to note that this plot does not account for the 'quality' attribute, our primary variable of interest. Therefore, while the plot suggests some structure in the data, further analysis is required to determine whether this structure correlates with wine quality. In conclusion, PCA provides a valuable tool for exploring and understanding the Wine dataset. Reducing the dimensionality of the data helps reveal patterns and structures that might not be apparent in the original high-dimensional data. Furthermore, visualizing the data in the space of the principal components offers insights into the relationships between different wines and their attributes, paving the way for further analysis and modeling of wine quality. ### 2. Summary Statistics #### a. Central Tendency The measures of central tendency (mean and median) provide a way to summarize data with a single number representing the center or middle of the data set. The mean is the average value, while the median is the middle value, dividing the data set into two halves. Below is the calculation of the mean and median for each attribute in the Wine dataset: | Attribute | Mean | Median | | -------- | -------- | -------- | | Fixed Acidity | 6.854788 | 6.8 | Volatile Acidity| 0.278241 | 0.26 | Citric Acid | 0.334192 | 0.32 | Residual Sugar | 6.391415 | 5.2 | Chlorides | 0.045772 | 0.043 | Free Sulfur| 35.308085 | 34| Total Sulfur Dioxide| 138.360657 | 134 | Density | 0.994027376 | 0.99374 | pH | 3.188267 | 3.18 | Sulphates | 0.489847 | 0.47 | Alcohol | 10.514267 | 10.4 | Quality | 5.877909 | 6| #### b. Dispersion The measures of dispersion (standard deviation and interquartile range) provide an overview of the spread or variability of the data. The standard deviation is a measure of the average distance between each data point and the mean. The interquartile range, on the other hand, is the range within which the central 50% of the data values fall. Below is the table of measures of dispersion | Attribute | Standard Deviation | Interquartile Range | | -------- | -------- | -------- | | Fixed Acidity | 0.843868 | 1 | Volatile Acidity| 0.100795 | 0.11 | Citric Acid | 0.121020 | 0.12 | Residual Sugar | 5.072058 | 8.2 | Chlorides | 0.021848 | 0.014 | Free Sulfur| 17.007137 | 23| Total Sulfur Dioxide| 42.498065 | 59 | Density | 0.002991| 0.004377 | pH | 0.151001| 0.19 | Sulphates | 0.114126 | 10.4 | Alcohol | 1.23062 | 1.9 | Quality | 0.88564 | 1| ### 3. Interpretation of Summary Statistics The summary statistics table offers a comprehensive overview of the dataset's attributes, central tendencies, and dispersion measures. We can detect skewness in the data distribution for each attribute by analyzing the mean and median. When the mean and median of an attribute are close, as seen in 'Fixed Acidity' and 'Density,' it suggests a symmetric distribution of data. Conversely, when the mean and median vary significantly, like in 'Residual Sugar,' it indicates a skewed distribution. We also observe that the mean and median values of 'Quality' (5.88 and 6, respectively) are near the middle of the possible quality range (0-10). This finding suggests that the wine samples are, on average, of medium quality. The dispersion, standard deviation, and Interquartile Range (IQR) provide insight into the variability and spread of data. For example, as seen in 'Total Sulfur Dioxide' and 'Residual Sugar,' a high standard deviation signifies a more extensive spread. In addition, it indicates that the data points tend to be further from the mean. Conversely, a low standard deviation, as seen in 'Density' and 'Chlorides,' suggests that the data points are closer to the mean. The Interquartile Range (IQR) gives us a clearer picture of the central 50% of values, excluding outliers. For instance, the IQR for 'Residual Sugar' is relatively high, indicating significant data spread around the median, further confirming the skewed distribution. The statistical summary of the dataset provides an essential foundation for our further analysis. In addition, it offers a valuable understanding of the underlying distribution and variability of the attributes, which will guide us in the feature selection process for our predictive modeling. ### 4. Feature Selection The next step in our analysis is feature selection. By utilizing the insights gained from the summary statistics, we aim to select those features that contribute most significantly to our target variable, 'Quality.' Following feature selection, we will implement machine learning classification algorithms, which will form the backbone of our predictive model. This sequential and analytical approach ensures that our model is robust, accurate, and reliable in predicting wine quality. ### 5. Machine Learning Classification Algorithms After the feature selection, we will proceed to implement machine learning algorithms. Based on the literature review and previous studies on wine quality prediction, we have identified four potential classification algorithms that have demonstrated promising results. These are ::: warning * **Support Vector Machine (SVM)**: An algorithm that has proven effective in high dimensional spaces and is effective when the number of dimensions exceeds the number of samples. * **Random Forest**: An ensemble learning method that operates by constructing multiple decision trees at training time and outputting the class that is the mode of the classes of the individual trees. * **k-Nearest Neighbors (kNN)**: A simple, instance-based learning algorithm that classifies a given observation based on the majority class of its k closest neighbors in the feature space. It's particularly useful when the data points form distinct clusters. ::: Each of these algorithms has its own strengths and characteristics that make them suitable for our problem. As we proceed with the implementation, we will further fine-tune these algorithms by optimizing their hyperparameters to ensure they deliver the best performance on our dataset. The final choice of the algorithm will be based on a careful comparison of their performance metrics, such as accuracy, precision, recall, and the F1-score. ### 6. Model Tuning and Validation Ensuring the generalizability and robustness of our models is critical to avoid overfitting and enhance prediction performance on unseen data. To this end, we will employ two essential techniques: ***Grid Search*** and ***Cross-Validation***. ***Grid Search*** is a hyperparameter tuning technique to identify our models' optimal hyperparameters. This exhaustive searching through a manually specified subset of the hyperparameters space of a learning algorithm provides us with the best parameters that will yield the most accurate predictions. Each of our chosen algorithms - SVM, Random Forest, Gradient Boosting Regressor, and k-NN - comes with its unique set of hyperparameters, and tuning them can significantly enhance model performance. ***Cross-Validation***, on the other hand, is a resampling procedure used to evaluate the models on a limited data sample. The procedure has a single parameter called k, which refers to the number of groups a given data sample will be split into. We will utilize k-fold cross-validation, randomly partitioning the original sample into k equal-size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times, with each k subsample used exactly once as the validation data. This method ensures that every observation from the original dataset can appear in the training and test set, which helps assess the model's effectiveness and its robustness to overfitting. Moreover, we plan to streamline the model validation process by implementing a loop to identify the optimal pairs of test_size values for our train-test split. While these parameters can be manually tuned, this approach can be time-consuming and not guarantee optimal results. By automating this process, we can focus our efforts on tuning the models' hyperparameters, which will likely significantly impact our models' performance. The correct train-test split is crucial as it ensures that our models are effectively trained on a sufficiently large dataset and accurately evaluated on an independent test set, enhancing their reliability and predictive power. ### 7. Testing with ML models We proceed with implementing the machine learning algorithms by importing the relevant packages. Then, we follow a similar process for each algorithm to train the models, tune the hyperparameters, and evaluate the model performance. #### 7.1 Preprocessing the Dataset Before applying the machine learning models, some preprocessing steps are necessary to ensure the best performance of our models. * Pearson Correlation: Correlation measures how a pair of variables are related. Here, we use the Pearson correlation coefficient, which measures the linear relationship between two datasets. This step allows us to identify and select the most relevant features that correlate highly with our target variable, which is the quality of the wine. * SMOTE (Synthetic Minority Over-sampling Technique): Our dataset is imbalanced, meaning that the representation of the classes in our target variable is unequal. Such an imbalance can cause a biased machine learning model, as it may predict the majority class most of the time. SMOTE is an oversampling method that creates synthetic samples of the minority class, making the dataset balanced. It helps increase the accuracy of the model without overfitting the data. * Getting the Count of the Least Represented Class: This step is critical because it helps us determine the number of neighbors to consider when creating the synthetic samples using SMOTE. The least represented class count minus one (if it is more than 1) or 1 (if the class count is less than 1) is used as the number of nearest neighbors in SMOTE. #### 7.2 Model Training and Evaluation For each of the machine learning models, we follow these steps: * Split the data into training and testing datasets. We use a list of different test sizes to see how the size of the test set affects the model's performance. * Standardize the data to ensure that all features have the same scale. This step is crucial because some machine learning algorithms, like SVM and k-NN, are sensitive to the scale of the features. * Initialize the machine learning model and set up the parameters for a grid search. * Implement GridSearchCV to find the best parameters for the model. * Fit the model to the training data. * Use the best parameters to make predictions on the test data. Evaluate the model's performance using the classification report, which provides vital metrics like precision, recall, and the F1-score. #### 7.3 Result Compilation and Comparison After training and evaluating all models, we compile and compare the results. The comparison helps us identify which model provides the highest prediction accuracy for wine quality in our dataset. Below are the tables of test sizes of 0.2 and 0.3 consecutively, ***Test size of 0.2*** | Model | Precision | Recall | F1-Score | Accuracy | | --- | --- | --- | --- | --- | | SVM | 0.89 | 0.89 |0.89| 0.88| | k-NN | 0.87 | 0.87 |0.88 |0.88 | | Random Forest | 0.88| 0.89|0.89| 0.89| ***Test size of 0.3*** | Model | Precision | Recall | F1-Score | Accuracy | | --- | --- | --- | --- | --- | | SVM | - | - | - | - | | k-NN |0.86 |0.86 |0.86 |0.86| | Random Forest | 0.87| 0.87 |0.87 |0.87| From the results, we can observe that the performance metrics (precision, recall, F1-score, and accuracy) of the models slightly degrade when we increase the test size from 0.2 to 0.3. The performance decline is seen for both k-NN and Random Forest models. Hence, for this particular dataset, a test size of 0.2 appears to be better suited. The Random Forest model yielded the best results in both scenarios, proving to be the most optimal model for this dataset, with an accuracy of 0.89 at a test size of 0.2. ![](https://hackmd.io/_uploads/BJJywQsSh.png) Regarding SVM, although the computation for a test size of 0.3 was not feasible due to computational constraints, the SVM model yielded good results at a test size of 0.2. Given that the performances of k-NN and Random Forest declined with an increase in test size, it can be inferred that the SVM model would follow a similar trend. Hence, the results obtained with the SVM model at a test size of 0.2 should be indicative enough for comparative analysis. ## III. Conclusion ## The journey of computational analysis and prediction is seldom straightforward, particularly when handling more extensive and complex datasets than the often-cited Iris dataset. Our experience working with this dataset has shown that managing a slightly larger dataset, such as this wine quality dataset, comes with unique challenges. The primary difficulty lies in the sheer computational power required. The increased volume of data demands more memory and processing power, and particular operations, particularly involving machine learning models such as SVM, can become significantly more time-consuming. As our test size increased, the computational demand exceeded our resources, which prevented us from determining the performance of the SVM model for a test size of 0.3. Another challenge we experienced was data imbalance, a common issue in real-world datasets that can lead to biased models if not addressed correctly. Using SMOTE to balance the classes in our dataset was a crucial step toward ensuring the accuracy and fairness of our models. Our study has reinforced the importance of accurate data preprocessing and feature selection in ensuring the success of subsequent model training. By thoroughly analyzing our data, we identified potential issues and addressed them proactively. This meticulous approach contributed to the excellent performance of our models. Feature selection, guided by the initial statistical summary and correlation matrix, allowed us to build more interpretable models and be less prone to overfitting. These streamlined models, in turn, led to better predictive performance, demonstrating that feature selection is a vital step in the data analysis pipeline. As we wrapped up our analysis, it became clear that Random Forest outperformed SVM and k-NN models, achieving the highest accuracy score. Still, every model had strengths and contributed to our understanding of the dataset. In conclusion, tackling larger datasets presents unique challenges, particularly regarding computational requirements and data imbalance. However, careful preprocessing and feature selection can mitigate these issues, setting the stage for successful model training and validation. While our analysis demonstrated that the Random Forest model was the best fit for our dataset, it is crucial to remember that the optimal model may vary depending on the specific characteristics of each dataset. Therefore, it is essential to approach each data analysis task with an open mind, always ready to adapt to the nuances of the data at hand. ## References [^1]: Cortez, P., Silva, F., & Heitor, M. V. (2009). An overview of machine learning applied to wine classification. In Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA '09) (pp. 117-122). IEEE.