Clustering Project? That’s CUTE.

Four reasons NOT to be excited by clustering algorithms.

12 min readMar 11, 2021

There are two kinds of people in the world; those who think there are two kinds of people in the world and those who do not.

Clustering often starts as an innocent act; for example, a product manager is determined to discover who their product’s users are. However, it can evolve into malicious segregation; surely, there are a discrete number of user personas that we can reference for future product decisions. Often this latent structure has a limited presence, but the incentive to stereotype remains strong. After all, evaluating on a continuum is harder to internalize and, therefore, more work.

Those who have worked with me know the vendetta I hold against clustering algorithms, not because I believe the methods themselves are flawed, but because there is a never-ending flow of shallow project suggestions centered on these solutions. This is likely due to the ease of application for the most common techniques, which allows beginners to implement them, and their perceived interpretability, which preys on the human tendency to categorize things into buckets. Unfortunately, this translates into clustering methods falling victim to extensive misuse.

The following are a set of issues, laid out in the form of a CUTE acronym to help one remember why clustering methods are more lackluster than they appear:

Credibility — rarely is an evaluation done as to if the clusters are significant or not; results are often accepted blindly
Underwhelming — while they are literally name-dropped by every analyst, product manager, and marketing director, they are rarely positioned and constructed in a way that creates impact.
Temporary — the values they generate are not stationary; assigned clusters for a specific period will likely not be the same as a different period, even if those time ranges are adjacent.
Exaggeration — often, attributing things to groups results in a misrepresentation of the similarities between those in the same group and an inaccurate amplification of the differences across groups.

We will examine these points thoroughly in the following sections and, at a high-level, explain how to be a more avid skeptic towards each. My hope is that after reading this, everyone can determine whether a clustering solution has been compromised and to what extent; bonus points for those that are newly convinced that the default response to any inquiry, problem, or conversation that proposes a clustering method as the primary and perhaps only solution should be met with the acronym phrase: “That’s CUTE.” If you already despise these techniques as much as I do, feel free to stop here to avoid triggering any previous trauma. However, if you are a masochist like me, or are just interested in seeing stick figures drawn by a complete amateur, read on!

A quick note that most of this post will focus on a generic and overused clustering case; attempting to group unlabelled users, consumers, or players who are similar into the same bucket to attach a label. However, the shortcomings mentioned here still apply to most scenarios, regardless of what is being clustered: products, locations, or anything else.

Credibility

Any dataset could be sent through a clustering algorithm, with the outcome being that each item is assigned to a single group. Unfortunately, nothing prevents a person from taking a set of truly unique individuals and forcing them to conform to a discrete number of identities. This problem is further exacerbated when out-of-the-box metrics associated with assessing these clusters’ validity are not widely understood or are simply not presented. For supervised algorithms, it is unheard of to see overwhelming confidence in the outputs without first evaluating the proposed model’s performance. Why, then, is it so rare to see people concerned about the credibility of clusters?

The main issue is that unsupervised problems such as clustering more readily allow for limited accountability; it is harder for a layman to point out why something seems wrong if there are no actual observations for them to fall back onto. The double-edged sword lies in the fact that not only are the most common clustering algorithms trivial to implement, but they often appear to be immediately successful to the untrained eye. This is the ultimate bait for an analyst with dreams of quickly transitioning to more ML-focused projects in the future or a “storyteller” looking to weave a narrative that will win them some quick points with stakeholders.

That said, one might speculate that a disproportionate amount of clustering projects are suggested by and, therefore, carried out by those with limited depth on the methods. For example, a search of introductory articles online for a simple clustering algorithm, such as KMeans, will yield an alarming number of results, many the output of a novice. To make matters worse, the public datasets often referenced in these posts, such as the legendary iris dataset, paint an overly optimistic view of the results that clustering analysis can yield. Finally, as expected, most of these articles do not include any reasoning as to why the determined clusters should be trusted or not, further pushing the narrative that clustering “just works.”

The critical question that should be asked in these situations is: “Did we actually find some structure in our users, or do we only want to believe that structure exists?” Effectively, we need a way to determine if our clusters are at least better than random. At a high-level, this can be done by randomly permuting each feature within itself across our dataset. From there, the clustering task can be carried out the same as before; what we will be surprised to find is that the outcome of this task often looks just as good on the surface as the real data does! If we tried hard enough, or barely at all, we could also convince stakeholders that these random clusters are meaningful. Of course, there are metrics we can compare between the two outcomes to understand any significance beyond random, but for now, the strategy above can serve as a simple sniff test. After all, our null hypothesis is that a latent structure does not exist, not the other way around.

Underwhelming

Even if clusters are credible, it is often likely that the information they provide will not impact the business. Since clustering is a computational exercise, it has no reason to search for results that will be automatically useful. This begs the question: if the chance of deriving any benefits from generic clustering methods is low, why do most companies focus on it so heavily in the first place?

The same reasons mentioned in the previous section play a part here, but there are some additional considerations. First, it should be clear that most companies looking to survive in the 21st century seek ways to better understand their consumers so they can serve them appropriately. Historically, one of the oldest organizations within a company, marketing, has relied on segmentations of their users, mainly by demographics. In many cases, this was a rudimentary exercise, splitting users into buckets based on heuristics selected by humans. For example, one segmentation might split out users by an arbitrary age, gender, and a basic indicator, such as if they have kids. At the end of the day, these metrics were used to generate profiles, such as “soccer moms.” More importantly, the size of these groups could be used to determine where advertising should be allocated.

What does this have to do with clustering methods? If it wasn’t apparent already, they are often seen as a modern-day equivalent to segmentation, so much that there is constant confusion between the two concepts. However, using a clustering technique to build useful “segmentations” typically requires some manual intervention, somewhat defeating the point. The original question still remains; why is everyone over-indexing on clustering?

Over the past decade, the thirst for ML and data-empowered solutions have grown immensely. In the wake of this explosion, the demand for leaders with relevant experience has far outpaced what the market can supply. For most older companies, the lack of expertise in these areas is typical. To launch a healthy analytics organization, they face a lose-lose situation; internally promote someone who has limited exposure to the area or attempt to hire externally with a high chance of acquiring a charlatan. Most appear to choose the former, and many departments are currently being overseen by someone with a marketing or similar business background who had some form of internal seniority previously. From here, it isn’t hard to connect the dots as to why innovation within the ML space at older companies begins and ends at clustering. The truth is, they just don’t know what else to do, and, in that situation, it is best to stick with what feels familiar; this hip new form of segmentation!

To give a concrete example of how a clustering technique can be underwhelming, take the problem of customer personalization; the goal is to serve each user content that is most likely to resonate with them. This problem sits comfortably in the realm of recommendation systems. Many methods can be used, ranging from simple, collaborative filtering to complex, such as deep factorization machines. I’ve had the fortune, or misfortune, of interviewing hundreds of candidates in a session where the underlying case study has been primed for a generic recommender system solution. Occasionally, candidates will frame the problem in a traditional supervised format that will suffice, but most concerning is the sheer number of individuals who propose clustering to generate recommendations. Instead of attempting to map recommendations one-to-one, the usual suggestion is to group users into a small number of buckets. There, either hand-pick the same content for each person in a group based on a heuristic or average the owned items across the group to create a ranking. Overall, there is some irony in calling this personalization as large sets of users are treated as if they were precisely the same. It should be clear how this is underwhelming for a user, yet, this still tends to be a typical personalization strategy for companies.

Unfortunately, we can not magically massage a clustering algorithm to make it useful for all the tasks we want to succeed in. The best thing we can do is take a less optimistic stance towards clustering, brainstorm other ways to formulate the problem so it can be solved by other techniques, or opt-out of a project entirely if it heavily relies on a clustering method. After all, our goal is to know our users, not lazily bucket them; there are often better solutions, but it will come down to how much you or your stakeholders care.

Temporary

Although clustering techniques are useful in providing a one-time breakdown of the player population, they struggle in bringing value to live long-term systems. While drift is an issue for all machine learning models, the problem is further exacerbated in the unsupervised setting, where a clear label is unknown. This means that through time the location of groups that users are assigned to will move, and in the worst case, the identity of a group will shift entirely. In a supervised setting, the target will move, but the underlying event that identifies that target, such as purchase, will not change. Therefore for unsupervised problems, there is a need to continually reassess what the target actually represents. An abnormal amount of drift over a short period may be further proof that the clusters’ credibility, as mentioned above, should be heavily questioned.

It is essential to point out that while most clustering scenarios exhibit this behavior, problematic drift is not guaranteed. For example, if you’re tracking users on a platform that has been around for several years, some sort of consistency across them should be expected. This can be determined by training multiple clustering models on subsets of data across time and comparing the output metrics, noting any inconsistencies. If it can be confirmed that the metrics are stable, then clustering may prove valuable long-term, but if this is not the case, no amount of band-aiding will miraculously bring clusters to life.

Exaggeration

When our minds have to deal with large amounts of messy data, we naturally want to find a way to structure it. Thus we bucket ideas and individuals into groups for simplification. This mental shortcut is at fault when we are assessing clusters. It is easy to forget that individuals within a created group could contain a wide array of variations. Still, our minds push us to exaggerate that they are the same. Instead of seriously evaluating the clusters, analysts and marketers just move on to the next step, creating a presentable story to tell about them. Personas are conceived to be believable by stakeholders, but they likely emerged based on a shallow comparison of average values between the clusters. Sadly, the act of constructing these personas only strengthens the exaggeration, as now there is a tangible description that we can anchor to.

On the flip side, there is an exaggeration across groups as well. If users are split out, the boundaries between one group and the next are assumed to be much larger than they actually are. This false amplification can have severe consequences and is ever-present in social and political groups. For example, studies attest that people associated with opposing political parties often overestimate the extremity of each other’s beliefs, and the assignment of political identity to an individual only perpetuates this pervasive feedback loop.

Who do you think cares more about social equality? The economy? Climate change? Gun ownership? Some of these issues may feel clear-cut, but in reality, ideologies exist on a spectrum. Often, there is a more extensive overlap amongst political beliefs than one would expect, but unfortunately, the entire group’s average masks any individuality. For example, let’s say we randomly pulled a conservative and a liberal off the street to determine who cares about a particular issue more. What is the chance that, given a specific topic, the conservative or liberal cares more than the other? The construction of these polarized personas entices us to exaggerate those differences and answer with something outrageous, like 90%. Still, in truth, that line is always closer to 50% than most of us would like to believe.

This happens due to a fixation with averages. Notice that for both scenarios above, the averages for the solid and dotted line are equivalent, but the distributions paint an entirely different picture. The perception driven by clustering is that these distributions are heavily peaked at those averages, and little to no overlap occurs. In reality, there is often a heavy overlap, suggesting that if we were to choose an individual from each group and compare their “amount of caring” on a given topic, that we would frequently get the opposite of what we would expect. To fight exaggeration, we need to look beyond averages to the spread of the distributions, continuously asking the question: “how likely is it that two users from different clusters are more similar than two users from the same cluster?”

Another way to alleviate exaggeration is to switch to a clustering method that accommodates softer outputs. For example, if four clusters are expected, each user would be assigned a weight for each, such as [.10, .20, .30, .40]. In contrast, a hard classification forces a decision, so the weight vector would have a form like [0, 0, 0, 1]. Note that it is still possible to assign a given user to their cluster of highest weight in the soft case, but now we have additional information to reason about how good that fit might be. If the values are uniform, such as [.25, .25, .25, .25], our output is no better than random, and if our weights are consistently peaked on one value, such as [.95, .05, 0, 0], then we can have confidence that some structure is present.

That was CUTE!

To summarize the things to keep in mind when using a clustering algorithm:

Don’t
At the very least, always be skeptical about the outputs of a clustering algorithm; test credibility by comparing metrics between the real dataset and a randomly permutated version of itself.
Search for more impactful solutions, or even better, switch to a more impactful problem than the one that has been suggested; ML is full of awesomeness, not just clustering algorithms.
Have a plan for evaluating and dealing with staleness due to drift; recognize that clusters will be remnants of their former self.
Steer clear of detailed personas; if they must be used, do not hide information that might weaken thresholds between groups — ideally, use algorithms with soft outputs instead of rigid classifications.

Finally, awareness is vital; inform others about CUTE! This will help end the abuse of clustering algorithms once and for all. After all, in the time it took for you to read this blog, an average of 3.8 people committed to a doomed clustering project — your support is needed to save the lives of future victims!