# Customer Segmentation 6 (3 clusters): 07/11/2023
* number of cluster: **3**
* segmentation Year: **2023**
* Selected Features:
* Customer_Financial_Health
* Customer_Commercial_Region
* Customer_Language
* Customer_Size
* Customer_Seniority_YearsGroup
* Customer_DirectIndirect
* Customer_MainPartner_Type_Segment
* contract_distinct_product_group
* updated_customer_segment
* Customer_Sector_category
* XSelling_NoLegalEntity_cat
* XSelling_NoProducts_cat
* Customer_Size_Evolution%_cat
* Customer_NoOfProducts_Evolution_cat
* cluster_id
## Clustering-specific metrics:
### Silhouette Score
* **Definition**:
This metric calculates the mean silhouette coefficient of all samples. Each sample's silhouette coefficient is computed as the difference between its average distance to the members of the same cluster (cohesion) and its average distance to the members of the nearest cluster to which it doesn't belong (separation). The silhouette coefficient for a sample ranges from -1 to 1.
* **Metric meaning**:
Silhouette Score considers both how close points in the same cluster are to each other and how separated a cluster is from its nearest neighboring cluster.
* **Interpretation**:
- A score close to 1 implies the sample is well clustered.
- A score close to 0 implies the sample is on or very close to the decision boundary between two neighboring clusters.
- A score close to -1 implies the sample is incorrectly clustered.
* **Result**:
* silhouette score: **0.27**
### Davies-Bouldin Score
- **Definition**:
This metric evaluates the average similarity ratio of each cluster with its most similar cluster, where similarity is a ratio of within-cluster distances to between-cluster distances. Hence, the closer to 0 the DB index, the better.
- **Interpretation**:
- Lower values indicate better clustering.
- A lower Davies-Bouldin score relates to a model with better separation between the clusters.
- **Metric meaning**:
Davies-Bouldin Score Score evaluates the ratio of between-cluster to within-cluster distances.
- **Results**:
- Davies-Bouldin: **7.58**
### Calinski-Harabasz Score (Variation Ratio Criterion)
- **Definition**:
This metric evaluates the average similarity ratio of each cluster with its most similar cluster, where similarity is a ratio of within-cluster distances to between-cluster distances. Hence, the closer to 0 the DB index, the better.
- **Interpretation**:
- Lower values indicate better clustering.
- A lower Davies-Bouldin score relates to a model with better separation between the clusters.
- **Metric meaning**:
Calinski-Harabasz Score Score evaluates the ratio of between-cluster to within-cluster distances, but differently than Davies-Bouldin.
- **Results**:
- Calinski-Harabasz score: **634.7**
## Clusters repartitions
| Cluster | Size (customers nb.) | Invoicing |
| ------- | ---------------------:|:-------------------:|
| 0 | 170.514 (64.1%) | 84.816.753 (32.9%) |
| 1 | 58.483 (22.0%) | 96.127.485 (37.3%) |
| 2 | 37.061 (13.9%) | 76.500.229 (29.7%) |
## Clusters representative SHAP analysis
### Cluster 0:
* SHAP graph for features importances:

* SHAP most important features heatmaps:
* Customer_Language:

* updated_customer_segment:

* Customer_Sector_category:

* SHAPStory analysis using ChatGPT:
The AI model predicted with 100% certainty that the customer is part of the cluster. The most influential positive SHAP values were for the features 'Customer_Sector_category', 'Customer_Commercial_Region', and 'Customer_Financial_Health'. This suggests that the customer's sector of activity, commercial region, and financial health were significant contributors to the model's prediction. The customer's sector of activity being 'Public Administration & Services' and their commercial region being 'Liège - Verviers - Eupen - Namur' likely contributed positively to their classification. Their financial health being 'Unknown' also had a positive impact, perhaps indicating that customers with unknown financial health are often part of this cluster.
On the other hand, the 'Customer_Language' feature had the most influential negative SHAP value, indicating that the customer's language, being German, decreased the likelihood of them being part of the cluster.
*In conclusion, the model's prediction was largely influenced by the customer's sector of activity, commercial region, and financial health. Despite the negative contribution of the customer's language, the positive contributions from the other features were strong enough to result in a prediction of the customer being part of the cluster.*
### Cluster 1:
* SHAP graph for features importances:

* SHAP most important features heatmaps:
* Customer_DirectIndirect:

* Customer_Financial_Health:

* updated_customer_segment:

* SHAPStory analysis using ChatGPT:
The AI model predicted that the customer is part of the cluster with a high probability of 98.48%. The most influential positive SHAP values were the Customer_Commercial_Region, Customer_Language, Customer_DirectIndirect, and contract_distinct_product_group. This suggests that the customer's commercial region being Oost-Vlaanderen, speaking Dutch, having a direct contract, and having a diverse product group significantly contributed to the model's prediction.
On the other hand, the most influential negative SHAP values were the Customer_Financial_Health and Customer_Sector_category. The customer's unknown financial health and their sector being Public Administration & Services negatively influenced the prediction.
There could be an interaction between the customer's commercial region and language, as Dutch is commonly spoken in Oost-Vlaanderen. Similarly, the customer's direct contract and diverse product group could indicate a strong business relationship, which might be common in the cluster.
*In summary, the customer's commercial region, language, contract type, and product diversity were key factors in predicting their cluster membership, despite some negative influence from their unknown financial health and sector.*
### Cluster 2:
* SHAP graph for features importances:

* SHAP most important features heatmaps:
* Customer_MainPartner_Type_Segment:

* contract_distinct_product_group:

* Customer_DirectIndirect:

* SHAPStory analysis using ChatGPT:
The AI model predicted with a high probability that the customer is part of a cluster, and this prediction was correct. The most influential positive SHAP values were for the features 'Customer_MainPartner_Type_Segment', 'XSelling_NoLegalEntity_cat', and 'XSelling_NoProducts_cat'. This suggests that the customer's main partner being of 'Accountant 4' type, the presence of different legal entities for cross-selling contracts, and the number of different products for cross-selling contracts significantly contributed to the customer being part of the cluster.
On the other hand, the most influential negative SHAP values were for 'Customer_Seniority_YearsGroup', 'updated_customer_segment', and 'Customer_Sector_category'. This indicates that the unknown seniority of the customer, the customer being an 'Employer', and the customer's sector being 'Other' negatively influenced the prediction.
The interaction between these features could have led to the prediction. For instance, the customer's main partner being of 'Accountant 4' type and the presence of different legal entities for cross-selling contracts could indicate a complex business structure, which is common in certain clusters.
*In conclusion, the classification occurred due to the combination of the customer's main partner type, cross-selling contracts details, seniority, customer type, and sector. These factors, when considered together, led the model to correctly predict the customer's cluster membership.*
## Business Features Analysis of all clusters
* Customer Segment:

* Activity Sectors:

* Direct/Indrect customers:

* Customer Size:

* Xselling number of products:
