Clustering is a widely-used Machine Learning (ML) technique. Clustering is an Unsupervised ML algorithm that is built to learn patterns from input data without any training, besides being able of processing data with high dimensions. This makes clustering the method of choice to solve a wide range and variety of ML problems.
Since clustering is widely used, for Data Scientists and ML Engineer’s it is critical to understand how to practically build clustering algorithms even though many of us have a high-level understanding of clustering. Let us understand the approach to build a clustering algorithm from scratch.
What is Clustering and how does it work?
Clustering is finding groups of objects (data) such that objects in the same group will be similar (related) to one another and different from (unrelated to) objects in other groups.
Clustering works on the concept of Similarity/Dissimilarity between data points. The higher similarity between data points, the more likely these data points will belong to the same cluster and higher the dissimilarity between data points, the more likely these data points will be kept out of the same cluster.
Similarity is the numerical measure of how alike two data objects are. Similarity will be higher when objects are more alike. Dissimilarity is the numerical measure of how different two data objects. Dissimilarity is lower when objects are more alike.
We create a ‘Dissimilarity Matrix’ (also called Distance Matrix) as an input to a clustering algorithm, where the dissimilarity matrix gives algorithm the notion of dissimilarity between objects. We build a dissimilarity matrix for each attribute of data considered for clustering and then combine the dissimilarity matrix for each data attribute to form an overall dissimilarity matrix. The dissimilarity matrix is an NxN square matrix where N is the number of data points considered for clustering and each element of the NxN square matrix gives dissimilarity between two objects.
Building Clustering Algorithm
Building a clustering algorithm involve the following:
- Selection of most suited clustering techniques and algorithms to solve the problem. This step needs close collaboration among SMEs, business users, data scientists, and ML engineers. Based on inputs and data study, a possible list of algorithms (one or more) is selected for modeling and development along with tuning parameters are decided (to give algorithm more flexibility for tuning and learning from SME).
- The selection of data attributes for the formulation of the dissimilarity matrix and methodology for the formation of the dissimilarity matrix (discussed later).
- Building algorithms and doing the Design of experiments to select the best-suited algorithm and algorithm parameters for implementation.
- Implementation of algorithm and fine-tuning of parameters as required.
Building a Dissimilarity matrix:
There are different approaches to build a dissimilarity matrix, here we consider building a dissimilarity matrix containing the distance (called Distance Matrix) between data objects (another alternative approach is to feed in coordinate points and let the algorithm compute distance). Let us consider a group of N data objects to be clustered based on three data attributes of each data object. The steps for building a Distance matrix are:
Build a Distance matrix for individual data attributes. Here we build three individual distance matrices (one for each attribute) containing distance between data objects calculated for each attribute. The data is always scaled between [0,1] using one of the standard normalization methods such as Min-Max Scalar. Here is how the distance matrix for an attribute looks like.
Properties of Distance Matrix:
- Distance Matrix is NxN square matrix (N – number of objects in clustering space)
- Matrix is symmetric with diagonal as zero (zero diagonal as distance of an object from itself is zero)
- For categorical data, distance between two points = 0, if both are same; =1 otherwise
- For numeric/ordered data, distance between two points = difference between scaled attribute values of two points.
Build Complete Distance matrix. Here we build a complete distance matrix combining distance matrix of individual attributes forming the input for clustering algorithm.
Complete distance matrix = (element-wise sum of individual attribute level matrix)/3;
Generalized Complete distance matrix = (element-wise sum of individual attribute level matrix)/M, where M is the number of attribute level matrix formed.
Considerations for the selection of clustering algorithms:
Before the selection of a clustering algorithm, the following considerations need to be evaluated to identify the right clustering algorithms for the given problem.
- Partition criteria: Single Level vs hierarchical portioning
- Separation of clusters: Exclusive (one data point belongs to only one class) vs non-exclusive (one data point can belong to more than one class)
- Similarity measures: Distance-based vs Connectivity-based
- Clustering space: Full space (used when low dimension data is processed) vs Subspace (used when high dimension data is processed, where only subspace can be processed and interesting clustering can be formed)
- Attributes processing: Ability to deal with different types of attributes: Numerical, Categorical, Text, Media, a combination of data types in inputs
- Discovery of clusters: Ability to form a predefined number of clusters or an arbitrary number of clusters
- Ability to deal with noise in data
- Scalability to deal with huge volumes of data, high dimensionality, incremental, or streaming data.
- Ability to deal with constraints on user preference and domain requirements.
Application of Clustering
There are broadly two applications of clustering.
As an ML tool to get insight into data. Like building Recommendation Systems or Customer segmentation by clustering like-minded users or similar products, Social network analysis, Biological data analysis like Gene/Protein sequence analysis, etc.
As a pre-processing or intermediate step for other classes of algorithms. Like some Pattern-mining algorithms use clustering to group patterns mined and select most representative patterns instead of selecting entire patterns mined.
Building ML algorithm is teamwork with a team consisting of SMEs, users, data scientists, and ML engineers, each playing their part for success. The article gives steps to build a clustering algorithm, this can be used as reference material while attempting to build your algorithm.
About the author
May 7, 2020
Gireesh Sreedhar KP
“Gireesh is a part of the projects run in collaboration with IIT Madras for developing AI solutions and algorithms. His interest includes Data Science, Machine Learning, Financial markets, and Geo-politics. He believes that he is competing against himself to become better than who he was yesterday. He aspires to become a well-recognized subject matter expert in the field of Artificial Intelligence.“