# Customer Segmentation 6 (4 clusters): 08/11/2023
* number of cluster: **4**
* segmentation Year: **2023**
* Selected Features:
* Customer_Financial_Health
* Customer_Commercial_Region
* Customer_Language
* Customer_Size
* Customer_Seniority_YearsGroup
* Customer_DirectIndirect
* Customer_MainPartner_Type_Segment
* contract_distinct_product_group
* updated_customer_segment
* Customer_Sector_category
* XSelling_NoLegalEntity_cat
* XSelling_NoProducts_cat
* Customer_Size_Evolution%_cat
* Customer_NoOfProducts_Evolution_cat
* cluster_id
## Clustering-specific metrics:
### Silhouette Score
* **Definition**:
This metric calculates the mean silhouette coefficient of all samples. Each sample's silhouette coefficient is computed as the difference between its average distance to the members of the same cluster (cohesion) and its average distance to the members of the nearest cluster to which it doesn't belong (separation). The silhouette coefficient for a sample ranges from -1 to 1.
* **Metric meaning**:
Silhouette Score considers both how close points in the same cluster are to each other and how separated a cluster is from its nearest neighboring cluster.
* **Interpretation**:
- A score close to 1 implies the sample is well clustered.
- A score close to 0 implies the sample is on or very close to the decision boundary between two neighboring clusters.
- A score close to -1 implies the sample is incorrectly clustered.
* **Result**:
* silhouette score: **0.26**
### Davies-Bouldin Score
- **Definition**:
This metric evaluates the average similarity ratio of each cluster with its most similar cluster, where similarity is a ratio of within-cluster distances to between-cluster distances. Hence, the closer to 0 the DB index, the better.
- **Interpretation**:
- Lower values indicate better clustering.
- A lower Davies-Bouldin score relates to a model with better separation between the clusters.
- **Metric meaning**:
Davies-Bouldin Score Score evaluates the ratio of between-cluster to within-cluster distances.
- **Results**:
- Davies-Bouldin: **13.4**
### Calinski-Harabasz Score (Variation Ratio Criterion)
- **Definition**:
This metric evaluates the average similarity ratio of each cluster with its most similar cluster, where similarity is a ratio of within-cluster distances to between-cluster distances. Hence, the closer to 0 the DB index, the better.
- **Interpretation**:
- Lower values indicate better clustering.
- A lower Davies-Bouldin score relates to a model with better separation between the clusters.
- **Metric meaning**:
Calinski-Harabasz Score Score evaluates the ratio of between-cluster to within-cluster distances, but differently than Davies-Bouldin.
- **Results**:
- Calinski-Harabasz score: **518.7**
## Clusters repartitions
| Cluster | Size (customers nb.) | Invoicing |
| ------- | ------------------------:|:-------------------:|
| 0 | 164.888 (62.0%) | 74.501.568 (28.9%) |
| 1 | 49.889 (18.8%) | 90.162.032 (35.0%) |
| 2 | 31.767 (11.3%) | 66.862.404 (26.0%) |
| 3 | 19.514 (7.3%) | 25.918.463 (10.0%) |
## Clusters representative SHAP analysis
### Cluster 0:
* SHAP graph for features importances:

* SHAP most important features heatmaps:
* Customer_Language:

* updated_customer_segment:

* Customer_Sector_segment:

* SHAPStory analysis using ChatGPT:
The AI model predicted with 100% certainty that the customer is part of the cluster, which was confirmed by the actual outcome. The most influential positive SHAP values were for the features 'Customer_Sector_category', 'Customer_Language', and 'updated_customer_segment'. This suggests that the customer's sector of activity being 'Public Administration & Services', their language being French, and their type being 'Principal_Entrepreneur' significantly contributed to the model's prediction. On the other hand, the most influential negative SHAP value was for 'Customer_DirectIndirect', indicating that the customer having at least one direct contract decreased the likelihood of them being part of the cluster.
The model may have identified a pattern where French-speaking customers in the public administration and services sector who are principal entrepreneurs are likely to be part of the cluster. However, having a direct contract could be a characteristic of customers not in the cluster. Therefore, the classification may have occurred due to the customer's sector, language, and type outweighing the negative influence of having a direct contract.
### Cluster 1:
* SHAP graph for features importances:

* SHAP most important features heatmaps:
* Customer_DirectIndirect:

* Customer_Financial_Health:

* updated_customer_segment:

* SHAPStory analysis using ChatGPT:
The AI model predicted with 100% certainty that the customer is part of the cluster. The most influential positive SHAP value was the customer's financial health, which was classified as 'Green'. This suggests that the customer's strong financial standing significantly contributed to the model's prediction. The customer's direct type of contract, commercial region, and language also had high positive SHAP values, indicating that these factors also played a crucial role in the prediction.
On the other hand, the evolution of the number of products of the customer had a negative SHAP value, implying that a decrease in the number of products might have slightly reduced the probability of the customer being part of the cluster.
*In summary, the customer's strong financial health, direct type of contract, commercial region, and language were key factors that led the AI model to predict that the customer is part of the cluster. The decrease in the number of products slightly offset these factors, but not enough to change the overall prediction.*
### Cluster 2:
* SHAP graph for features importances:

* SHAP most important features heatmaps:
* contract_distinct_product_group:

* XSelling_NoLegalEntity_cat:

* Customer_MainPartner_Type_Segment:

* SHAPStory analysis using ChatGPT:
The AI model predicted with a 100% probability that the customer is part of the cluster. The most influential positive SHAP values were the 'Customer_Seniority_YearsGroup' and 'XSelling_NoLegalEntity_cat'. This suggests that the customer's long-standing relationship with the company (10-15 years) and the number of different legal entities for cross-selling contracts significantly contributed to the prediction.
On the other hand, the 'Customer_Size_Evolution%_cat' and 'Customer_NoOfProducts_Evolution_cat' had the least influence on the prediction, indicating that the evolution of the customer's company size and the number of products did not significantly impact the model's decision.
The 'Customer_MainPartner_Type_Segment' and 'contract_distinct_product_group' also had high SHAP values, suggesting that the type of the customer's main partner and the variety of contracts the customer has with the company also played a role in the prediction.
*In conclusion, the model's prediction was primarily influenced by the customer's long-standing relationship with the company, the variety of legal entities for cross-selling contracts, the type of the customer's main partner, and the variety of contracts the customer has with the company. The evolution of the customer's company size and the number of products had minimal impact on the prediction.*
### Cluster 3:
* SHAP graph for features importances:

* SHAP most important features heatmaps:
* Customer_DirectIndirect:

* Customer_MainPartner_Type_Segment:

* Customer_Seniority_YearsGroup:

* SHAPStory analysis using ChatGPT:
The AI model predicted with a high probability that a customer is part of a certain cluster. The most influential positive SHAP values were Customer_Size, updated_customer_segment, and Customer_Financial_Health. This suggests that the customer's company size, type, and financial health played a significant role in the model's prediction. The customer's company size was small (1-4), indicating a small business, and the customer's financial health was classified as 'Green', indicating a healthy financial status. The customer's type was classified as 'Employer', which might suggest a certain level of stability and predictability in their behavior.
On the other hand, the least influential features were Customer_Size_Evolution%_cat and Customer_NoOfProducts_Evolution_cat, indicating that the evolution of the customer's company size and the number of products did not significantly impact the prediction.
*In conclusion, the model's prediction was primarily influenced by the customer's company size, type, and financial health, while the evolution of the customer's company size and the number of products had minimal impact.*
## Business Features Analysis of all clusters
* Customer Segment:

* Activity sectors:
