07 Lab [en] - Data visalization, Clustering

# 07 Lab [en] - Data visalization, Clustering ###### tags: `Data Visualization` `clustering` `data mining` [TOC] # Introduction - Purpose of the exercise The purpose of this exercise is to improve your ability to use visual analytics tools in Tableau. # 1. Clustering Among the analytical tools offered by Tableau are methods for grouping data - known as cluster analysis. Clustering is a technique for grouping objects with similar properties, and the cluster formed during clustering is a class of similar objects. Clustering is a machine learning method that belongs to the class of unsupervised learning methods - its aim is to find "natural" clusters for a set of objects. The equivalent of this method in supervised learning is the classification process. The clustering problem is to divide the set **P** into **k** clusters such that the function **F** takes a maximum value, where: * **k** - number of clusters, * **P** - the set of objects, * **F** - cluster quality assessment function (objective function). Clustering is the activity of splitting the data into partitions that give an insight about the unlabelled data. It gives a structure to the data by grouping similar data points. We see clustering algorithms almost everywhere in our everyday life. Cluster Analysis has and always will be a staple for all Machine Learning alghoritms. Clustering has its applications in many Machine Learning tasks: label generation, label validation, dimensionality reduction, semi supervised learning, Reinforcement learning, computer vision, natural language processing. For a data scientist, cluster analysis is one of the first tools in their arsenal during exploratory analysis, that they use to identify natural partitions in the data. ## Sample clustering problem Typical clustering algorithm groups examples in unsupervised learning. The k-means algorithm basically does the following: Iteratively determines the best k center points (known as centroids). Assigns each example to the closest centroid. Those examples nearest the same centroid belong to the same group. The k-means algorithm picks centroid locations to minimize the cumulative square of the distances from each example to its closest centroid. For example, consider the following plot of dog height to dog width: ![](https://i.imgur.com/k2Vwe30.png) A Cartesian plot with several dozen data points. If k=3, the k-means algorithm will determine three centroids. Each example is assigned to its closest centroid, yielding three groups: ![](https://i.imgur.com/8w8LeLJ.png) *The same Cartesian plot as in the previous illustration, except with three centroids added. The previous data points are clustered into three distinct groups, with each group representing the data points closest to a particular centroid.* Imagine that a manufacturer wants to determine the ideal sizes for small, medium, and large sweaters for dogs. The three centroids identify the mean height and mean width of each dog in that cluster. So, the manufacturer should probably base sweater sizes on those three centroids. Note that the centroid of a cluster is typically not an example in the cluster. The preceding illustrations shows k-means for examples with only two features (height and width). Note that k-means can group examples across many features. ## Example of clustering - analysing the attractiveness of countries regarding tourism for senior citizens The aim of the task is to clasify countries to set groups of countries, among which tourism potential of people aged 65+ is at a similar level. ## 1.1 Source data The analysis will be carried out on the Tableau built-in set of economic indicators - **World Indicators**. ## 1.2 Identification of variables used for clustering During the preparation of the analysis, it is the designer/analyst who decides which variables will be used to perform the clustering process. The choice of variables should be logically justified, i.e. one should look for variables whose values can be determinative for the breakdown under analysis. In our task, as variables used during clustering I propose to adopt: * life expectancy of women, life expectancy of men - the longer people in the 65+ group live the longer they can be participants in the tourism market, * size of the 65+ group (in % of the general population)- the larger it is, the bigger the tourist market addressed to it can be, * total trust expenditure per person in a given country - a higher value positively influences the market of tourist services. ## 1.3 Evaluation of clustering results To evaluate the clustering results, display the *Describe clusters* option from the pop-up menu. ![](https://i.imgur.com/gCCparp.png) This option brings up a window in which we have the following information available: * the coordinates of the centres of each cluster, * the number of points classified in each group, * **Between-group sum of squares** - a metric quantifying the separation between clusters as a sum of squared distances between each cluster’s center (average value), weighted by the number of data points assigned to the cluster, and the center of the data set. The larger the value, the better the separation between clusters. * **Within-group sum of squares** - A metric quantifying the cohesion of clusters as a sum of squared distances between the center of each cluster and the individual marks in the cluster. The smaller the value, the more cohesive the clusters. :::info In general, the analysis is more successful for the larger value of metric: **(between-group sum of squares)/(total sum of squares)**. The above metric can take values in the range <0; 1>. The ratio gives the proportion of variance explained by the model - larger values typically indicate a better model. However, you can increase this ratio just by increasing the number of clusters, so it could be misleading if you compare a five-cluster model with a three-cluster model using just this value. ::: ![](https://i.imgur.com/PCIX6SR.png) ## 1.4 Create a group from cluster results If you drag a cluster to the Data pane, it becomes a group dimension in which the individual members (Cluster 1, Cluster 2, etc.) contain the marks that the cluster algorithm has determined are more similar to each other than they are to other marks. After you drag a cluster group to the Data pane, you can use it in other worksheets. Drag Clusters from the Marks card to the Data pane to create a Tableau group: ![](https://i.imgur.com/DBSitm3.png) ## 1.4.1 Refit saved clusters When you save a Clusters field as a group, it is saved with its analytic model. You can use your cluster groups in other worksheets and workbooks, however, they don't automatically refresh. If the underlying data changes, you can use the **Refit** option to refresh and recompute the data for a saved clusters group. ![](https://i.imgur.com/bsc4xVx.png) # 2. Exercise - Customer Segmentation: Clustering Try to cluster the customer dataset: * choose apropriate measures/dimensions for clustering, * choose optimal number of clusters, * describe clusters as groups. ## 2.1 Kaggle > **Kaggle**, a subsidiary of Google LLC, is an online community of data scientists and machine learning practitioners. Kaggle allows users to find and publish data sets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges. > > ![](https://i.imgur.com/sEG0wnQ.png) > > Kaggle got its start in 2010 by offering machine learning competitions and now also offers a public data platform, a cloud-based workbench for data science, and Artificial Intelligence education. Equity was raised in 2011 valuing the company at $25 million. On 8 March 2017, Google announced that they were acquiring Kaggle. ### 2.1.1 Dataset [Customer Dataset](https://www.kaggle.com/code/karnikakapoor/customer-segmentation-clustering/input) - Customer Personality Analysis is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviors and concerns of different types of customers. Customer personality analysis helps a business to modify its product based on its target customers from different types of customer segments. For example, instead of spending money to market a new product to every customer in the company’s database, a company can analyze which customer segment is most likely to buy the product and then market the product only on that particular segment. ## 2.2 Scatter plot ![image](https://hackmd.io/_uploads/BkyeBqrgR.png)