Hierarchical in practice

 Customer segmentation:

Problem statement:

To segment customers into different groups based on their shopping trends(Mall data set)

import pandas as pd

import matplotlib.pyplot as plt

% matplotlib inline

dataset=pd.read_csv("D:\\Raj_DataScience\\Documents\\Mall_Customers.csv")

dataset.shape

(200,5)

dataset.head()

CustomerIDGenderAgeAnnual Income (k$)Spending Score (1-100)
01Male191539
12Male211581
23Female20166
34Female231677
45Female311740

We will retain only two of these five columns. We can remove customer Id, Genre, Age columns. We retain only Annual income and spending score. The spending score column signifies how often a person spends money in a mall on a scale of 1 to 100. 100 being the highest spender.

data=dataset.iloc[:,3:5].values

Next, we need to know the clusters that we want our data to be split to. We will again use the scipy library to create the dendrograms for our data set.

import scipy.cluster.hierarchy as shc

plt.figure(figsize=(10,7))

plt.title("Customer Dendrograms")

dend=shc.dendrogram(shc.linkage(data,method='ward'))

from sklearn.cluster import AgglomerativeClustering

cluster=AgglomerativeClustering(n_clusters=5,affinity='euclidean',linkage='ward')

cluster.fit_predict(data)

array([4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3,
       4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 1,
       4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2,
       1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2,
       0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
       0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
       0, 2], dtype=int64)

In the above code, we import the hierarchy class of the scipy.cluster library as shc. The hierarchy class has a dendrogram method which takes the value returned by the linkage method of the same class. The linkage method takes the data set and method to minimize distances as parameters. we use 'ward' as  the method since it minimizes the variants of distances between the clusters.

If we draw a horizontal line that passes through longest distance without a horizontal line, we got 5 clusters. To do so we will again use the AgglomerativeClustering class of the sklearn.cluster library.

from sklearn.cluster import AgglomerativeClustering
cluster=AgglomerativeClustering(n_clusters=5,affinity='euclidean',linkage='ward')
cluster.fit_predict(data)
You can see that cluster labels from all of your data points. Since we have five clusters, we have five labels in the output i.e 0 to 4.
Let's plot the clusters to see how actually our data has been clustered.
plt.figure(figsize=(10,7))
plt.scatter(data[:,0],data[:,1],c=cluster.labels_,cmap='rainbow')
You can see the data points in the form of five clusters

The data points in the bottom right belong to the customers with high salaries but low spending. These are the customers that spend their money carefully.

Similarly the customers at top right(Green), are the customers with high salaries and high spending. These are the customers that companies target.

The customers in the middle(Blue) are the ones with average income and average salaries. The highest number of customers belong to this category. Companies can also target these customers given the fact that they are in huge numbers. etc..


Comments