My tech musings: An Impossibility Theorem For Clustering

Clustering is one of the most commonly employed task when doing data analysis. The objective of clustering is to group a given set of data points based on some distance metric such that similar points are grouped together in a cluster where as dissimilar points are grouped in other clusters. There are a myriad algorithms for doing clustering, some of the common ones are k-means, agglomerative clustering, dbscan, spectral.

Even though these techniques are commonly used and taught but there is hardly any discussion about their limitations. Jon Kleinberg changed this with his paper titled "An Impossibility Theorem for Clustering" wherein he formally defined the limits of clustering algorithms in terms of three properties and showing that all clustering algorithms can satisfy only at most two of the three properties. This post is an attempt to summarise and explain the paper while skipping the proof of the theorem(s).

Before someone complains, much of the prose is verbatim copied from the paper. The language used in the paper is very simple and there was no point in twisting it.

The Impossibility Theorem:

Before talking about the theorem, let's define some things concretely:

Data (S):

We refer to the data being clustered as the set S of n points.

Distance Function (d):

Distance function is any function d : S x S →ℝ such that for distinct i,ȷ Є S, we have d(i,ȷ) ≥ 0, d(i,ȷ) = 0 if and only if i = ȷ, and d(i,ȷ) = d(ȷ,i).

Clustering:

In terms of the distance function d, the clustering function can be defined as a function ƒ that takes a distance function d on S and returns a partition Γ of S.

The paper then defines three properties: scale-invariance, richnesss and consistency and defines the impossibility theorem based on them. Let's look at the definition of these properties.

Scale-invariance:

If d is a distance function then ⍺.d is is the distance function in which distance between i and j is ⍺d(i,j). Scale-invariance for a clustering function ƒ is defined as ƒ(d) = f(⍺.d) for any ⍺ > 0. What this basically means is that even if we change the scale of the distance function or its unit, the output of the clustering should not change.

Richness:

This simply means that all clusterings of S should be possible. Mathematically range(ƒ) = the set of all partitions of S. In other words, suppose we are given the names of the points only (i.e. the indices in S) but not the distances between them. Richness requires that for any desired partition Γ, it should be possible to construct a distance function d on S for which ƒ(d) = Γ.

Consistency:

A clustering function is said to be consistent if when we decrease the distance between points within a cluster and expand distances between points in different clusters, the clustering remains unchanged. Mathematically, if Γ is a particular clustering of S using distance function d, and we have another distance function d' on S, such that for all i,j within a cluster we have d'(i, j) < d(i, j) and for all i, j in different clusters of Γ we have d'(i, j) > d(i, j), then if ƒ(d') = Γ, then ƒ is consistent, i.e. we get the same clustering using the two distance functions. Also, d' is called a Γ transformation of d here.

Impossibility Theorem:

There is no clustering function which satisfies Scale-invariance, richness and consistency.

Understanding with Examples:

The paper mathematically proves the theorem in full generality, and also shows how it applies in context of some of the common clustering algorithms. Skipping the proof, we can directly try to see how it applies in common clustering models.

Single Linkage Clustering:

Single linkage is a form of hierarchical clustering where we start by representing every single point as its own cluster and iteratively merge these clusters with one another based on their distances. The algorithm stops based on a stopping criteria. The implication of the theorem is that no matter which stopping criteria we choose, the resulting clustering function will only satisfy at most 2 of the 3 properties defined above. Following are the three commonly used stopping conditions used in single linkage - let's see which 2 of the 3 properties they satisfy.

k-clusters: We can stop the clustering as soon as we have reached k clusters. This eliminates the richness property, because not all possible clusterings can be obtained with this condition.

distance-r stopping condition: In this case we merge two clusters only if their distance <= some distance r. This does not satisfy the scale-invariance property. Because if we obtain a clustering C using a distance function d, and then scale the distance function by ⍺, then the clustering would change as depending on the change in scale, clusters which were previously merged may not merge anymore or clusters which could not be merged previously since their distance was greater than r, can now be merged due to change in scale.

scale-α stopping condition: Let ρ denote the maximum pairwise distance in S using a distance function d1, then we will only merge two clusters if their distance is <= αρ. For any α < 1, this condition does not satisfy the consistency property. Let's see why - let's say d2 is a Γ transformation of d1, then by definition the maximum pairwise distance obtained using d2 would be greater than ρ. Therefore the clustering obtained using d2 would not be same as that obtained using d1.

Centroid Based Clustering:

Centroid based clustering refers to the commonly used k-means and k-median algorithms. Where we start with a predefined k number of clusters by selecting k points in the data as centroids and then assigning each point to their nearest cluster. These algorithms suffer with the problem of not satisfying the richness property. But the paper also proves that they don't satisfy the consistency property as well. I'm not reproducing the proof for the sake of brevity.

Final Thoughts:

In practice we are aware that different clustering algorithms may produce different results on same data. But this paper throws light on this area in an organized manner and highlights the trade-offs involved in choosing different algorithms and their parameters. These results may help one decide which qualities in the output of the clustering algorithm are more important for them and design the model accordingly, e.g., if richness is more important (i.e. we want to make sure that all possible clusterings are viable) but perhaps consistency is not important, then we know what to do.

My tech musings

Thursday, December 10, 2020

An Impossibility Theorem For Clustering