Comprehensive Guide To K-Medoids Clustering Algorithm (2024)

Last updated July 15, 2024
In Developers Corner

K-Medoids is a clustering algorithm resembling the K-Means clustering technique. It falls under the category of unsupervised machine learning.

Share

Published onApril 10, 2021

byNikita Shiledarbaxi

Comprehensive Guide To K-Medoids Clustering Algorithm (2)

Comprehensive Guide To K-Medoids Clustering Algorithm (3)

K-Medoids is a clustering algorithm resembling the K-Means clustering technique. It falls under the category of unsupervised machine learning. It majorly differs from the K-Means algorithm in terms of the way it selects the clusters’ centres. The former selects the average of a cluster’s points as its centre (which may or may not be one of the data points) while the latter always picks the actual data points from the clusters as their centres (also known as ‘exemplars’ or ‘medoids’). K-Medoids also differs in this respect from the K-Medians algorithm whic,h is the same as K-means, except that it chooses the medians (instead of means) of the clusters as centres.

We have already covered the basics of the K-Means clustering method in our previous article. Refer to it before proceeding if you are unfamiliar with the K-Means algorithm.

Working of the K-Medoids approach

The steps followed by the K-Medoids algorithm for clustering are as follows:

Randomly choose ‘k’ points from the input data (‘k’ is the number of clusters to be formed). The correctness of the choice of k’s value can be assessed using methods such as silhouette method.

Each data point gets assigned to the cluster to which its nearest medoid belongs.

For each data point of cluster i, its distance from all other data points is computed and added. The point of ith cluster for which the computed sum of distances from other points is minimal is assigned as the medoid for that cluster.

Steps (2) and (3) are repeated until convergence is reached i.e. the medoids stop moving.

Complexity of K-Medoids algorithm

The complexity of the K-Medoids algorithm comes to O(N2CT) where N, C and T denote the number of data points, number of clusters and number of iterations respectively. With similar notations, the complexity K-Means algorithm can be given as O(NCT).

Advantages of the technique

Mean of the data points is a measure that gets highly affected by the extreme points. So in K-Means algorithm, the centroid may get shifted to a wrong position and hence result in incorrect clustering if the data has outliers because then other points will move away from . On the contrary, a medoid in the K-Medoids algorithm is the most central element of the cluster, such that its distance from other points is minimum. Since medoids do not get influenced by extremities, the K-Medoids algorithm is more robust to outliers and noise than K-Means algorithm. The following figure explains how mean’s and medoid’s positions can vary in the presence of an outlier.

Comprehensive Guide To K-Medoids Clustering Algorithm (5)

Image source

Besides, K-Medoids algorithm can be used with arbitrarily chosen dissimilarity measure (e.g. cosine similarity) or any distance metric, unlike K-Means which usually needs Euclidean distance metric to arrive at efficient solutions.

K-Medoids algorithm is found useful for practical applications such as face recognition. The medoid can correspond to the typical photo of the individual whose face is to be recognized. But if K-Means algorithm is used instead, some blurred image may get assigned as the centroid, which has mixed features from several photos of the individual and hence makes the face recognition task difficult.

Practical Implementation

Here’s a demonstration of implementing K-Medoids algorithm on a dataset containing 8*8 dimensional images of handwritten digits. The task is to divide the data points into 10 clusters (for classes 0-9) using K-Medoids. The dataset used is a copy of the test set of the original dataset available on UCI ML Repository. The code here has been implemented in Google colab using Python 3.7.10 and scikit-learn-extra 0.1.0b2 versions. Step-wise explanation of the code is as follows:

Install scikit-learn-extra Python module, an extension of scikit-learn designed for implementing more advanced algorithms that cannot be used by mere inclusion of scikit-learn in the code.

!pip install scikit-learn-extra

Import required libraries and modules.

 import numpy as np import matplotlib.pyplot as plt from sklearn_extra.cluster import KMedoids #Import the digits’ dataset available in sklearn.datasets package from sklearn.datasets import load_digits “”” Instead of using all 64 attributes of the dataset, we use Principal Component Analysis (PCA) to reduce the dimensions of features set such that most of the useful information is covered. “”” from sklearn.decomposition import PCA “”” Import module for standardizing the dataset i.e. rescaling the data such that its has mean of 0 and standard deviation of 1 “”” from sklearn.preprocessing import scale

Prepare the input data

 #Load the digits dataset dataset = load_digits() #Standardize the data digit_data = scale(dataset.data) “””Compute number of output classes i.e. number of digits for which we have the data (here 10 (0-9)) “”” num_digits = len(np.unique(dataset.target))

Reduce the dimensions of the data using PCA.

 red_data = PCA(n_components=2).fit_transform(digit_data) “”” PCA constructs new components by linear combinations of original features. ‘n_components’ parameter denotes the number of newly formed components to be considered. fit_transform() method fits the PCA models and performs dimensionality reduction on digit_data. “””

Plot the decision boundaries for each cluster. Assign a different color to each for differentiation.

 h = 0.02 #step size of the mesh #Minimum and maximum x-coordinates xmin, xmax = red_data[:, 0].min() - 1, red_data[:, 0].max() + 1 #Minimum and maximum y-coordinates ymin, ymax = red_data[:, 1].min() - 1, red_data[:, 1].max() + 1 xx, yy = np.meshgrid(np.arange(xmin, xmax, h), np.arange(ymin, ymax, h))

Define an array of K-Medoids variants to be used. We have used three different distance metrics (Manhattan distance, Euclidean distance and Cosine dissimilarity/distance) for computing the distance of each data point from every other data point while selecting the medoid. Visit this page to know about the distance metrics used in detail.

The parameters we have specified in the KMedoids() method have the following significance:

metric – distance metric to be used (default: ‘euclidean’)
n_clusters – number of clusters to be formed and hence the number of medoids (one per cluster) (default value: 8)
init – ‘heuristic’ method used for medoid initialization

For each data point, itd distance from all other points is computed and the

distances are summed up. N_clusters number of points for which such a sum of

distances are minimum, are chosen as medoids.

max_iter – maximum number of the algorithm’s iterations to be performed when fitting the data

The KMedoids() method of scikit-learn-extra by default used the PAM (Partition Around Medoids) algorithm for finding the medoids.

 models = [ ( KMedoids(metric="manhattan", n_clusters=num_digits, init="heuristic", max_iter=2),"Manhattan metric", ), ( KMedoids(metric="euclidean", n_clusters=num_digits, init="heuristic", max_iter=2),"Euclidean metric", ), (KMedoids(metric="cosine", n_clusters=num_digits, init="heuristic", max_iter=2), "Cosine metric", ), ]

Initialize the number of rows and columns of the plot for plotting subplots of each of the three metrics’ results.

 #number of rows = integer(ceiling(number of model variants/2)) num_rows = int(np.ceil(len(models) / 2.0)) #number of columns num_cols = 2

Fit each of the model variants to the data and plot the resultant clustering.

 #Clear the current figure first (if any) plt.clf() #Initialize dimensions of the plot plt.figure(figsize=(15,10)) “””The ‘models’ array defined in step (6) contains three tuples, each having a model variant’s parameters and its descriptive text. We iterate through each of the tuples, fit the data to the model and plot the results. “”” for i, (model, description) in enumerate(models): # Fit each point in the mesh to the model model.fit(red_data) #Predict the labels for points in the mesh Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot Z = Z.reshape(xx.shape) #Subplot for the ith model variant plt.subplot(num_cols, num_rows, i + 1) #Display the subplot plt.imshow( Z, #data to be plotted interpolation="nearest", #bounding box coordinates (left,right,bottom,top) extent=(xx.min(), xx.max(), yy.min(), yy.max()), cmap=plt.cm.Paired, #colormap aspect="auto", #aspect ratio of the axes origin="lower", #set origin as lower left corner of the axes ) plt.plot( red_data[:, 0], red_data[:, 1], "k.", markersize=2, alpha=0.3 ) # Plot the centroids as white cross marks centroids = model.cluster_centers_ plt.scatter( centroids[:, 0], centroids[:, 1], marker="x", s=169, #marker’s size (points^2) linewidths=3, #width of boundary lines color="w", #white color for centroids markings zorder=10, #drawing order of axes ) #describing text of the tuple will be title of the subplot plt.title(description) plt.xlim(xmin, xmax) #limits of x-coordinates plt.ylim(ymin, ymax) #limits of y-coordinates plt.xticks(()) plt.yticks(()) #Upper title of the whole plot plt.suptitle( #Text to be displayed "K-Medoids algorithm implemented with different metrics\n\n", fontsize=20, #size of the fonts ) plt.show()

Output:

Comprehensive Guide To K-Medoids Clustering Algorithm (6)

Code source: scikit-learn-extra documentation
Google colab notebook of the above implementation

📣 Want to advertise in AIM? Book here

Nikita Shiledarbaxi

A zealous learner aspiring to advance in the domain of AI/ML. Eager to grasp emerging techniques to get insights from data and hence explore realistic Data Science applications as well.

clustering algorithms, Guide, K-Medoids

A guide to clustering with OPTICS using PyClustering

Sourabh Mehta12/05/2022

How to perform fast and explainable clustering using CLASSIX?

Vijaysinh Lendave08/02/2022

All You Need to Know About Guided Image Filtering

Basawanyya Hiremath22/12/2021

Council Post: A Guide To Analytics-Based Crisis Management For Leaders

Indrajit Mitra21/12/2021

Avi Gopani20/12/2021

A Guide to Term-Document Matrix with Its Implementation in R and Python

Yugesh Verma19/12/2021

A Hands-On Guide to Automatic Music Generation using RNN

Yugesh Verma15/12/2021

A Complete Guide to ktrain: A Wrapper for TensorFlow Keras

Yugesh Verma08/12/2021

Comprehensive Guide To K-Medoids Clustering Algorithm (16)

Comprehensive Guide To K-Medoids Clustering Algorithm (18)

19th - 23rd Aug 2024

Generative AI Crash Course for Non-Techies

Upcoming Large format Conference

Cypher 2024India's Biggest AI Summit

Sep 25-27, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

AI Won’t Kill Your Computer Science Degree

Anshul Vipat

However, OpenAI’s chief Sam Altman thinks otherwise, and says “University degrees are IMO status and not substance at this point.”

Meet Deepak Joy Cheenath, co-founder of Quizizz

Pabitra Moharana

Figure 02 the Most Advanced Humanoid AI Robot

Vandana Nair

Top Editorial Picks

Google, NVIDIA, and Microsoft to InvestINR 3,200 Crore in Madhya Pradesh

Mohit Pandey

Namma Yatri Launches Lifetime Zero-Commission Cab Service on ONDC in Delhi NCR

Siddharth Jindal

Amgen is Launching Biotechnology and AI Hub in Hyderabad

Mohit Pandey

Hugging Face Acquires XetHub to Build and Scale Millions of Large LLMs

Siddharth Jindal

Mistral Releases La Plateforme for Building AI Agents

Mohit Pandey

Lentra Joins AWS’ Independent Software Vendor Accelerate Programme

Donna Eva

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration withNVIDIA.

Join the Community >>

GenAI
Corner

View All

CRED Money – Personal Finance Manager with Advanced Data Science

Synthetic Data Generation in Simulation is Keeping ML for Science Exciting

NVIDIA AI Summit India 2024 – 5 Key Things to Expect

If Only AI of Today Could Write Windows…

Humanoids: The New Employees Who Work Cheap and Never Complain

Generative AI is Complicating the Process of Note-Taking

Is OpenAI Intel’s Biggest Regret Ever?

LG AI Research Launches EXAONE 3.0: A Bilingual Language Model for Open Research

FAQs

Comprehensive Guide To K-Medoids Clustering Algorithm? ›

K-medoids clustering is a variant of K-means that is more robust to noises and outliers. Instead of using the mean point as the center of a cluster, K-medoids uses an actual point in the cluster to represent it. Medoid is the most centrally located object of the cluster, with minimum sum of distances to other points.

What is the k-medoids clustering algorithm? ›

Tell Me More ›

What is the K median clustering algorithm? ›

In statistics, k-medians clustering is a cluster analysis algorithm. It is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median.

Tell Me More ›

What is the difference between KMeans and k-medoids? ›

K-Means: Typically uses Euclidean distance, which may not be suitable for all data types. K-Medoids: Can use any distance measure, providing more flexibility.

Read The Full Story ›

What is the difference between PAM and k-medoids? ›

PAM is one algorithm to find a local minimum for the k-medoids problem. Maybe not the optimum, but faster than exhaustive search. PAM is to k-medoids as Lloyd's algorithm is to k-means. Lloyd's algorithm is a fast heuristic to find a good solution to k-means, but it may fail to find the best.

Tell Me More ›

What are the advantages and disadvantages of k-medoids clustering? ›

Advantages: o Easy to understand and execute. o Quick and convergent in a predetermined number of steps. o Normally less delicate to outliers than k-means. o Allows using general dissimilarities of objects. Disadvantages: o Different initial sets of medoids can lead to different final clustering.

Read On ›

What is the difference between K mode clustering and K clustering? ›

Unlike K-means, which works with numerical data, K-modes focuses on finding clusters based on categorical attributes. It's useful for segmenting data with non-numeric features like customer preferences, product categories, or demographic information.

Know More ›

Which clustering algorithm is better than Kmeans? ›

DBSCAN Clustering Algorithm

Instead of assuming that clusters are spherical like K-Means, DBSCAN can identify clusters of arbitrary shapes. The algorithm works by grouping together points that are close to each other based on a distance metric and a minimum number of points required to form a cluster.

Learn More Now ›

What is the difference between k-means and k-median? ›

In k-means, centroids are determined by minimizing the sum of the squares of the distance between a centroid candidate and each of its examples. In k-median, centroids are determined by minimizing the sum of the distance between a centroid candidate and each of its examples.

Find Out More ›

What is the purpose of K clustering algorithm? ›

K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K.

Find Out More ›

What is the alternative to K clustering? ›

Partitioning clustering

The K-means method is sensitive to outliers. An alternative to k-means clustering is the K-medoids clustering or PAM (Partitioning Around Medoids, Kaufman & Rousseeuw, 1990), which is less sensitive to outliers compared to k-means. Read more: Partitioning Clustering methods.

Learn More Now ›

Why is hierarchical clustering better than Kmeans? ›

K Means clustering needed advance knowledge of K i.e. no. of clusters one want to divide your data. In hierarchical clustering one can stop at any number of clusters, one find appropriate by interpreting the dendrogram. One can use median or mean as a cluster centre to represent each cluster.