2019/2020

## Clustering

Clustering algorithms are unsupervised learning algorithms aimed at grouping similar items together.

Clustering analysis is broadly used in applications such as…

• exploratory data analysis
• market research
• customer segmentation
• topics discovery

… and in every task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar to each other than to those in other groups.

## What does similarity mean? Unsupervised grouping of similar (?) items… We need to explicitly define a similarity function to measure similarity between two objects.

## Similarity-Based Clustering

Explicitly define a function to measure (dis)similarity between two objects. This is usually a metric, thus satisfying:

• symmetry: $$d(\textbf{x},\textbf{y})=d(\textbf{y},\textbf{x})$$
• separation: $$d(\textbf{x},\textbf{y})=0 \iff \textbf{x}=\textbf{y}$$
• triangular inequality: $$d(\textbf{x},\textbf{z}) \leq d(\textbf{x},\textbf{y}) + d(\textbf{y},\textbf{z})$$

The choice of the similarity function is crucial.

While the cosine similarity is quite popular, the Euclidean distance is usually not well-suited for text clustering, unless term vectors are properly normalized.

## K-Means Clustering

K-Means is based on Euclidean distances between data points, partitioning the observations into $$k$$ sets so as to minimize the objective function: ## Algorithm

• Represent text objects as term vectors
• Start with $$k$$ randomly selected points and assume they are the centroids of $$k$$ clusters (initialization points)
• Assign every point to a cluster whose centroid is the closest to the point
• Re-compute the centroid for each cluster based on the newly assigned points in the cluster
• Repeat this process until the algorithm converges, i.e. when the assignments no longer change ## Agglomerative Hierarchical Clustering

Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.

The Agglomerative Hierarchical Clustering is a bottom-up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

In general, the results of hierarchical clustering are usually presented in a dendrogram.

## Algorithm

• Represent text objects as term vectors and define a similarity function (e.g. Euclidean distance, cosine similarity, etc.)
• Treat each point as a separate cluster
• Identify the two clusters that are closest together
• Merge the two most similar clusters
• Iterate until some stopping criterion is met (number of clusters, similarity threshold, etc.).

How to identify the two clusters that are closest together? We need a measure of group similarity, i.e. distance between sets of observations (linkage criterion). Some common choices:

• Single Linkage: the similarity between two clusters is the greatest similarity (minimum distance) between any pairs from these two clusters.
• Complete Linkage: the similarity between two clusters is the smallest similarity (maximum distance) between any pairs from these two clusters.
• Average Linkage: the similarity between two clusters is the average similarity/distance between the pairs in the two clusters. ## Direct Evaluation

How close are the system-generated clusters to the expected, manually labelled, clusters?

• Given a test set, manually label the dataset and create the ideal cluster result; e.g. assign each cluster to the class which is most frequent in the cluster
• The problem becomes a supervised learning task, i.e. a classification problem
• Apply the standard evaluation metrics used for classification tasks

### Example: Purity

• Assign each document to the class which is most frequent in its cluster.
• Count the number of correctly assigned documents and divide by the total number of documents.

### Example: Rand Index

• Create a list of all possible pairs of documents.
• Label each pair as (1) the two documents are similar or (0) the two documents are not similar.
• Count the number of times two similar documents are in the same cluster or two dissimilar documents are in different clusters. Divide by the total number of pairs.

## Indirect Evaluation

How useful are the clustering results for the intended application?

• Create a test set for the intended application to quantify the performance of any (machine learning) system for this application
• Choose a benchmark system to compare with
• Add a clustering algorithm to the benchmark system
• Compare the performance of the clustering enhanced system with the benchamrk, in terms of any performance measure for the application

## Take Home Concepts

• Clustering is the process of grouping similar entities together
• Clustering is an unsupervised machine learning technique
• The choice of the similarity function is crucial
• How K-Means works
• How Agglomerative Hierarchical Clustering works
• Evaluation of clustering can be done both directly and indirectly