# Customer Segmentation 4 (3 clusters): 31/10/2023 * number of cluster: **3** * segmentation Year: **2023** * Selected Features: * Customer_Financial_Health * Customer_Commercial_Region * Customer_Size * Customer_Seniority_YearsGroup * Customer_WithEmployee_Evolution * Customer_MainPartner_Type_Segment * updated_customer_segment * Customer_Sector_category * case_main_type_cat * XSelling_NoLegalEntity_cat * XSelling_NoProducts_cat * Amount_Invoicing_cat * cases_nb_cat * Digital_Sessions_cat * cases_complaint_ratio_cat * partner_nb_cat * case_mean_working_hours_cat * contract_direct_ratio_cat * Amount_Invoicing_Per_Person_cat * Customer_Size_Evolution%_cat * Customer_NoOfCases_Evolution%_cat * Customer_NoOfProducts_Evolution%_cat * Customer_Invoicing_Evolution_cat * Customer_Invoicing_Evolution%_cat * Customer_Invoicing_PerPerson_Evolution%_cat * Customer_NoOfCases_Evolution_cat * Customer_Size_Evolution_cat * Customer_NoOfProducts_Evolution_cat * Customer_Invoicing_PerPerson_Evolution_cat * cluster_id ## Clustering-specific metrics: ### Silhouette Score * **Definition**: This metric calculates the mean silhouette coefficient of all samples. Each sample's silhouette coefficient is computed as the difference between its average distance to the members of the same cluster (cohesion) and its average distance to the members of the nearest cluster to which it doesn't belong (separation). The silhouette coefficient for a sample ranges from -1 to 1. * **Metric meaning**: Silhouette Score considers both how close points in the same cluster are to each other and how separated a cluster is from its nearest neighboring cluster. * **Interpretation**: - A score close to 1 implies the sample is well clustered. - A score close to 0 implies the sample is on or very close to the decision boundary between two neighboring clusters. - A score close to -1 implies the sample is incorrectly clustered. * **Result**: * silhouette score: **0.09** ### Davies-Bouldin Score - **Definition**: This metric evaluates the average similarity ratio of each cluster with its most similar cluster, where similarity is a ratio of within-cluster distances to between-cluster distances. Hence, the closer to 0 the DB index, the better. - **Interpretation**: - Lower values indicate better clustering. - A lower Davies-Bouldin score relates to a model with better separation between the clusters. - **Metric meaning**: Davies-Bouldin Score Score evaluates the ratio of between-cluster to within-cluster distances. - **Results**: - Davies-Bouldin: **2.91** ### Calinski-Harabasz Score (Variation Ratio Criterion) - **Definition**: This metric evaluates the average similarity ratio of each cluster with its most similar cluster, where similarity is a ratio of within-cluster distances to between-cluster distances. Hence, the closer to 0 the DB index, the better. - **Interpretation**: - Lower values indicate better clustering. - A lower Davies-Bouldin score relates to a model with better separation between the clusters. - **Metric meaning**: Calinski-Harabasz Score Score evaluates the ratio of between-cluster to within-cluster distances, but differently than Davies-Bouldin. - **Results**: - Calinski-Harabasz score: **357.2** ## Clusters repartitions | Cluster | Size (customers nb.) | Invoicing | | ------- | --------------------:|:-------------------:| | 0 | 156.550 (59.2%) | 68.807.162 (26.7%) | | 1 | 31.459 (11.9%) | 156.197.782 (60.7%) | | 2 | 76.573 (28.9%) | 32.452.269 (12.6%) | ## Clusters representative SHAP analysis ### Cluster 0: * SHAP graph for features importances: ![](https://hackmd.io/_uploads/HkjRe9RG6.png) * Most relevant features Heatmap: * Customer_NoOfCases_Evolution_Cat: ![heatmap_nb_cases_evo.png](https://hackmd.io/_uploads/By8RQxWX6.png) * Customer_MainPartner_Type_Segment: ![heatmap_main_partner_type.png](https://hackmd.io/_uploads/rkvwOl-m6.png) * Amount_Invoicing_Per_Person_cat: ![heatmap_invoice_per_person.png](https://hackmd.io/_uploads/Sk2aFeWma.png) * SHAPStory analysis using ChatGPT: The AI model predicted that the customer is part of the cluster with a 100% probability. The most influential positive SHAP values were from the features: 'Customer_MainPartner_Type_Segment', 'updated_customer_segment', 'Amount_Invoicing_cat', 'Customer_NoOfCases_Evolution_cat', 'Customer_Invoicing_PerPerson_Evolution_cat', and 'Amount_Invoicing_Per_Person_cat'. This suggests that the customer's main partner type, updated customer segment, total invoicing, evolution of the number of cases, evolution of invoicing per person, and invoicing per employee significantly contributed to the customer being part of the cluster. On the other hand, the most influential negative SHAP value was from the 'Customer_Seniority_YearsGroup' feature, indicating that the customer's seniority years negatively influenced the prediction. The interaction between these features could be that the customer's financial health, commercial region, and size may have influenced the type of main partner they have, their customer segment, and their invoicing amount. The customer's seniority years could have affected their number of cases and invoicing per person, leading to their classification in the cluster. *In conclusion, the customer's classification as part of the cluster was primarily due to their main partner type, customer segment, total invoicing, number of cases evolution, invoicing per person evolution, and invoicing per employee, despite their seniority years.* ### Cluster 1: * SHAP graph for features importances: ![](https://hackmd.io/_uploads/r1mk-5AMT.png) * Most relevant features Heatmap: * updated_customer_segment: ![](https://hackmd.io/_uploads/SJmdNs0Gp.png) * cases_nb_cat: ![heatmap_case_nb.png](https://hackmd.io/_uploads/Bybs5gbma.png) * Customer_Size: ![](https://hackmd.io/_uploads/BJfldi0Ma.png) * SHAPStory analysis using ChatGPT: The AI model predicted with 100% certainty that the customer is part of the cluster, based on the customer's data. The most influential positive SHAP values were for the features 'XSelling_NoLegalEntity_cat', 'cases_nb_cat', and 'case_main_type_cat'. This suggests that the number of different legal entities for cross-selling contracts, the number of cases opened by the customer, and the main category of the customer's cases significantly contributed to the customer being part of the cluster. On the other hand, the most influential negative SHAP values were for 'Customer_Sector_category' and 'Customer_Seniority_YearsGroup'. This indicates that the customer's sector activity and their seniority in years negatively influenced the prediction. The interaction between these features could be that customers with a high number of legal entities for cross-selling contracts and a high number of cases are more likely to be part of the cluster, regardless of their sector activity or seniority. *In summary, the classification may have occurred due to the customer's high engagement with the company through numerous cases and cross-selling contracts, despite their sector activity and seniority.* ### Cluster 2: * SHAP graph for features importances: ![](https://hackmd.io/_uploads/SJp1-90f6.png) * Most relevant features Heatmap: * updated_customer_segment: ![](https://hackmd.io/_uploads/SJmdNs0Gp.png) * Amount_Invoicing_Per_Person_cat: ![heatmap_invoice_per_person.png](https://hackmd.io/_uploads/Sk2aFeWma.png) * Customer_NoOfCases_Evolution_cat: ![heatmap_nb_cases_evo.png](https://hackmd.io/_uploads/By8RQxWX6.png) * SHAPStory analysis using ChatGPT: The AI model predicted with a 100% probability that the customer is part of the cluster. The most influential positive SHAP values were the 'updated_customer_segment', 'Amount_Invoicing_cat', and 'Amount_Invoicing_Per_Person_cat'. This suggests that the customer's type, total invoicing, and invoicing per employee significantly contributed to the prediction. The customer is a 'Principal_Entrepreneur' with a total invoicing between '101 - 334' and invoicing per person between '100 - 300'. On the other hand, the most influential negative SHAP value was 'Customer_Financial_Health'. The customer's financial health was 'Unknown', which might have negatively influenced the prediction. The interaction between these features suggests that despite the unknown financial health, the customer's type and invoicing details were strong indicators of their cluster membership. *In summary, the classification occurred due to the customer's type, total invoicing, and invoicing per employee, despite the unknown financial health.* ## Business Features Analysis of all clusters * Invoicing groups (proportion): ![](https://hackmd.io/_uploads/Hyhv4jAG6.png) * Customer Segments (proportion): ![](https://hackmd.io/_uploads/SJmdNs0Gp.png) * Activity Sectors (proportion): ![](https://hackmd.io/_uploads/SkYY4oRMa.png) * Direct/Indirect ratio (proportion): ![](https://hackmd.io/_uploads/BJ6IBsAzp.png) * Customers size (proportion): ![](https://hackmd.io/_uploads/BJfldi0Ma.png) * XSelling Number of Products: ![heatmap_xselling.png](https://hackmd.io/_uploads/BJTfleZm6.png)