K-Means clustering explanation

Suppose we have a set of the fallowing two dimensional data instances named "D"

D = {(5,3),(10,15),(15,12),(24,10),(30,45),(85,70),(71,80),(60,78),(55,52),(80,91)}

We want to divide this data into two clusters (K=2 i.e, c1 and c2 ) based on the similarity between the data points.

First step is randomly initialize values for the centroids of both clusters

C1= (5,3)

C2= (10,15)

Iteration:1

S. No.

Data Points

Euclidean distance from

Cluster centroid C1=(5,3)

Euclidean distance from

Cluster centroid C2=(10,15)

Assigned

Cluster

1

(5,3)

0

13

C1

2

(10,15)

13

0

C2

3

(15,12)

13.45

5.83

C2

4

(24,10)

20.24

14.86

C2

5

(30,45)

48.87

36

C2

6

(85,70)

104.35

93

C2

7

(71,80)

101.41

89

C2

8

(60,78)

93

80

C2

9

(55,52)

70

58

C2

10

(80,91)

115.52

103.32

C2

Here, we clustered on the basis of Euclidean distance, compare the distances and nearest point  will be added to particular cluster. Look at the third data point (15,12). It has a distance of 13.45 from C1 while  a distance of 5.83 units from C2; therefore it has been clustered in C2.

After assigning data points to the corresponding clusters, the next step is to calculate the new centroid values. These values are calculated by finding the means of the coordinates of the data points that belong to a particular cluster.

For cluster C1, there is currently only one point i.e (5,3) therefore the mean of the coordinates remain same, and the new centroid value for C1 will be(5,3).

For C2, there are currently 9 data points. C2(x,y) as given below:

C2(x) = (10+15+24+30+85+71+60+55+80)/9 = 47.77

C2(y) = (15+12+10+45+70+80+78+52+91)/9 = 50.33

The updated centroid value for C2 will be now {47.77,50.33}

For the next iteration, the new centroid values C1 and C2 will be used and the whole process will be repeated.

Iteration :2

S. No.

Data Points

Euclidean distance from

Cluster centroid C1=(5,3)

Euclidean distance from

Cluster centroid C2=(47.77,50.33)

Assigned

Cluster

1

(5,3)

0

63.79

C1

2

(10,15)

13

51.71

C1

3

(15,12)

13.45

50.71

C1

4

(24,10)

20.24

46.81

C1

5

(30,45)

48.87

18.55

C2

6

(85,70)

104.35

42.10

C2

7

(71,80)

101.41

37.68

C2

8

(60,78)

93

30.25

C2

9

(55,52)

70

7.42

C2

10

(80,91)

115.52

51.89

C2


C1(x) = (5+10+15+24)/4 = 13.5

C2(y) = (3+15+12+10)/4 = 10.0

Updated C1 to be: (13.5,10.0)

C2 (x) = (30+85+71+60+55+80)/6 = 63.5

C2(y) = (45+70+80+78+52+91)/6 = 69.33

Updated C2 to be: (63.5,69.33)

Iteration :3

S. No.

Data Points

Euclidean distance from

Cluster centroid C1=(13.5,10)

Euclidean distance from

Cluster centroid C2=(63.5,69.33)

Assigned

Cluster

1

(5,3)

11.01

88.44

C1

2

(10,15)

6.10

76.24

C1

3

(15,12)

2.5

75.09

C1

4

(24,10)

10.5

71.27

C1

5

(30,45)

38.69

41.40

C1

6

(85,70)

93.33

21.51

C2

7

(71,80)

90.58

13.04

C2

8

(60,78)

82.37

9.34

C2

9

(55,52)

59.04

19.30

C2

10

(80,91)

104.80

27.23

C2


C1(x) = (5+10+15+24+30)/5 = 16.8

C1(y) = (3+15+12+10+45)/5 = 17.0

Updated C1 to be (16.8,17.0)

C2(x) = (85+71+60+55+80)/5 = 70.2

C2(y) = (70+80+78+52+91)/5 = 74.2

updated C2 to be (70.2,74.2)

Iteration :4

S. No.

Data Points

Euclidean distance from

Cluster centroid C1=(16.8,17.0)

Euclidean distance from

Cluster centroid C2=(70.2,74.2)

Assigned

Cluster

1

(5,3)

18.30

96.54

C1

2

(10,15)

7.08

84.43

C1

3

(15,12)

5.31

83.16

C1

4

(24,10)

10.04

79.09

C1

5

(30,45)

30.95

49.68

C1

6

(85,70)

86.37

15.38

C2

7

(71,80)

83.10

5.85

C2

8

(60,78)

74.74

10.88

C2

9

(55,52)

51.80

26.90

C2

10

(80,91)

97.31

19.44

C2

C1(x) = (5+10+15+24+30)/5 = 16.8

C1 (y) = (3+15+12+10+45)/5 = 17.0

updated C1 to be (16.8,17.0)

C2(x) = (85+71+60+55+80)/5 = 70.2

C2(y) = (70+80+78+52+91)/5 = 74.1

updated C2 to be (70.2,74.2)

At the end of fourth iteration, the updated values of C1 and C2 are same as they were at the end of the third iteration. This means that data can not be clustered any further.

Dis advantages:

  • The biggest disadvantage is that k-means requires you to pre-specify the number of clusters(k).
  • k-means is sensitive to outliers and different results can be occur if you change the ordering of the data.
  • k-means is a lazy learner where generalization of training data is delayed until a query is made to the system.
  • It requires a possibly large amount of memory to store the data.


Comments