K-Means

From CS Wiki

K-Means is one of the most popular unsupervised machine learning algorithms used for clustering data into distinct groups. The algorithm partitions a dataset into k clusters, where each data point belongs to the cluster with the nearest mean. It is widely used for data analysis, pattern recognition, and feature engineering.

How K-Means Works[edit | edit source]

The K-Means algorithm follows an iterative process to assign data points to clusters and optimize the cluster centroids:

  1. Initialize k cluster centroids randomly or using a heuristic method (e.g., K-Means++ initialization).
  2. Assign each data point to the cluster whose centroid is nearest (based on a distance metric such as Euclidean distance).
  3. Recalculate the centroids of each cluster as the mean of all data points assigned to that cluster.
  4. Repeat steps 2 and 3 until the centroids converge or the maximum number of iterations is reached.

Example[edit | edit source]

Given a dataset, K-Means groups the points as follows:

from sklearn.cluster import KMeans
import numpy as np

# Example dataset
data = np.array([[1, 2], [1, 4], [1, 0],
                 [10, 2], [10, 4], [10, 0]])

# Apply K-Means
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(data)

# Results
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)

Parameters of K-Means[edit | edit source]

  • Number of Clusters (k): The number of clusters to create. This parameter must be chosen carefully, as it significantly impacts the algorithm's performance.
  • Centroid Initialization: The initial positions of the centroids can affect convergence and results. K-Means++ is a commonly used initialization method to improve performance.
  • Distance Metric: The measure used to compute the distance between points and centroids, typically Euclidean distance.

Advantages[edit | edit source]

  • Simple and easy to implement.
  • Scales well to large datasets.
  • Works efficiently when clusters are well-separated.

Limitations[edit | edit source]

  • Sensitive to the initial placement of centroids, which may lead to suboptimal results.
  • Assumes spherical clusters and equal cluster sizes, which may not always reflect the data distribution.
  • Requires the number of clusters (k) to be specified beforehand.

Applications[edit | edit source]

K-Means is used across various domains for clustering and grouping tasks:

  • Image Segmentation: Partitioning an image into meaningful segments or regions.
  • Customer Segmentation: Grouping customers based on purchasing behavior or demographics.
  • Anomaly Detection: Identifying unusual patterns or outliers in datasets.
  • Document Clustering: Grouping documents based on similarity in text content.

Variants of K-Means[edit | edit source]

Several variations of the K-Means algorithm have been developed to address its limitations:

  • K-Means++: Improves the initialization process to reduce convergence time and improve results.
  • Mini-Batch K-Means: Processes subsets of data to handle large-scale datasets efficiently.
  • Fuzzy C-Means: Allows data points to belong to multiple clusters with varying degrees of membership.
  • Bisecting K-Means: Iteratively splits clusters for hierarchical clustering.

Choosing the Number of Clusters[edit | edit source]

Selecting the appropriate value of k is crucial. Common methods include:

  • Elbow Method: Plot the within-cluster sum of squares (WCSS) against k. The optimal k is at the "elbow point," where the rate of decrease slows significantly.
  • Silhouette Analysis: Measures how similar data points are within a cluster compared to other clusters.
  • Gap Statistic: Compares the total within-cluster variation for different values of k with that expected under null reference data.

Related Concepts and See Also[edit | edit source]