Hierarchical in practice

Customer segmentation:

Problem statement:

To segment customers into different groups based on their shopping trends(Mall data set)

import pandas as pd

import matplotlib.pyplot as plt

% matplotlib inline

dataset=pd.read_csv("D:\\Raj_DataScience\\Documents\\Mall_Customers.csv")

dataset.shape

(200,5)

dataset.head()

	CustomerID	Gender	Age	Annual Income (k$)	Spending Score (1-100)
0	1	Male	19	15	39
1	2	Male	21	15	81
2	3	Female	20	16	6
3	4	Female	23	16	77
4	5	Female	31	17	40

We will retain only two of these five columns. We can remove customer Id, Genre, Age columns. We retain only Annual income and spending score. The spending score column signifies how often a person spends money in a mall on a scale of 1 to 100. 100 being the highest spender.

data=dataset.iloc[:,3:5].values

Next, we need to know the clusters that we want our data to be split to. We will again use the scipy library to create the dendrograms for our data set.

import scipy.cluster.hierarchy as shc

plt.figure(figsize=(10,7))

plt.title("Customer Dendrograms")

dend=shc.dendrogram(shc.linkage(data,method='ward'))

from sklearn.cluster import AgglomerativeClustering

cluster=AgglomerativeClustering(n_clusters=5,affinity='euclidean',linkage='ward')

cluster.fit_predict(data)

array([4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3,
       4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 1,
       4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2,
       1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2,
       0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
       0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
       0, 2], dtype=int64)

In the above code, we import the hierarchy class of the scipy.cluster library as shc. The hierarchy class has a dendrogram method which takes the value returned by the linkage method of the same class. The linkage method takes the data set and method to minimize distances as parameters. we use 'ward' as the method since it minimizes the variants of distances between the clusters.

If we draw a horizontal line that passes through longest distance without a horizontal line, we got 5 clusters. To do so we will again use the AgglomerativeClustering class of the sklearn.cluster library.

from sklearn.cluster import AgglomerativeClustering

cluster=AgglomerativeClustering(n_clusters=5,affinity='euclidean',linkage='ward')

cluster.fit_predict(data)

You can see that cluster labels from all of your data points. Since we have five clusters, we have five labels in the output i.e 0 to 4.

Let's plot the clusters to see how actually our data has been clustered.

plt.figure(figsize=(10,7))

plt.scatter(data[:,0],data[:,1],c=cluster.labels_,cmap='rainbow')

You can see the data points in the form of five clusters

The data points in the bottom right belong to the customers with high salaries but low spending. These are the customers that spend their money carefully.

Similarly the customers at top right(Green), are the customers with high salaries and high spending. These are the customers that companies target.

The customers in the middle(Blue) are the ones with average income and average salaries. The highest number of customers belong to this category. Companies can also target these customers given the fact that they are in huge numbers. etc..

Data science with_Raj

Search This Blog

Hierarchical in practice

Comments

Post a Comment