Skip to content

shivamkc01/Mall-Segmentation-using-K-Means-Clusting-F1-score-99-Recall-98-Precision-100-auc-99-

Repository files navigation

Customer Segmentation

EDA


We start by taking a look at the distributions of the numerical variables Age, ID and Income.

We notice ID has a uniform distribution, which makes sense since it is an identifier of the customer and it will be dropped below. The variable 'Age' has a heavy right skew, generated because of the lower limit at zero of the variable. If we are using K-Means clustering, there will be no need to normalize the feature, but we may have to do so for other models. The feature 'Income' has the same right skew problem as the 'Age' feature. We'll have to be wary of this depending on the model we select.


There is some small correlation between the Age and Income features. We notice that the slope of the line is very small, which shows that the correlation between the features is low. Below we can see the calculation of the Pearson correlation factor, which states that values have a small correlation between them.

General Conclusions about the data


Even though there was no missing data, we noticed a few trends that could be used or could be an expected outcome from the models.

First, we notice that some numerical features (Age and Income) have a right-skewed normal distribution. We will have to correct that for the model to perform correctly, since it assumes normality in our features. Most likely, a log transform will correct this skew.

Second, about the data itself, we noticed that there is a small correlation between age and income, as expected. People in smaller cities have lower income in the dataset. Income is higher as occupation feature is higher. Non-singles (married, divorced, widowed or separated) tend to have lower income than single people, and for some strange reason tend to be younger than single people (in this dataset). When they are older, males tend to have higher income than females. Most unemployed people and married people in the dataset live in small cities. There are more unemployed women than men in the dataset. Screen Shot 2022-03-19 at 1 48 57 PM Screen Shot 2022-03-19 at 1 49 09 PM

Selecting the correct number of clusters using Elbow methods

Screen Shot 2022-03-19 at 1 49 27 PM

Visualization


We have already clustered the data into 6 distinct groups and done PCA to get 3 features out of the 7 we originally had. It is always important to remember that using PCA inherently means a loss of information, so the projections of the data in the new features X1, X2 and X3 can have some overlapping points, but in reality, when using K-Means clustering the border points are clearly defined.

Screen Shot 2022-03-19 at 1 51 03 PM

Decision Tree as a method to interpret clusters:


An alternative way to visualize and understand clusters is by way of using decision trees. We can make a decision tree predict the labels of each cluster we have determined, and in doing so the tree will determine splitting points based on the features we pass to the model. In this way, we can create the cluster descriptions based on how the decision tree splits the data.

We will use graphviz as our tree visualization tool. If you haven't downloaded this library, but have matplotlib, there is also an option to plot trees with the matplotlib library, but the visualization is way clearer with graphviz!

Classification report using Confusion Martrix:


Screen Shot 2022-03-19 at 1 59 56 PM

Visualization of the Clustering Tree


I'm calling the following tree a 'Clustering Tree' as it aids in defining the clustering algorithm parameters and gives and idea of how the data should be interpreted from our results. The clustering tree returns the results shown below, it is important to note that clusters are named in the same order as they were defined in previous sections.

Screen Shot 2022-03-19 at 2 00 11 PM

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published