Overpowered — Part One: Discovering Unbalanced Champions in League of Legends
A data-empowered framework for identifying and ranking broken characters in hero-based games.
If you’re here for a current list of broken champions in League of Legends, I am afraid you have fallen victim to clickbait. However, if you’re looking for a long-standing framework that leverages data to assess the strength of champions against each other, you have arrived at the right place. Over the past few decades, large-scale service games incorporating many heroes or champions have become popular. Some prominent examples:
- VALORANT — 20 Agents
- Apex Legends — 23 Legends
- Overwatch 2 — 35 Characters
- Dota 2— 123 Heroes
- League of Legends — 162 Champions
One of the challenges in managing these games is keeping up with the ever-expanding difficulty of character balance. With each new character, minor changes to the game can have more significant rippling effects. Since it is impossible to cover all the balance components in these systems in a single post, this article will explore the critical entry point; how to be prompt and confident in identifying broken champions. With this step, it will be easier for the design team to focus their energy on the right areas.
A “Magic Quadrant” Framework
First, it is essential to call out that player perception of a champion’s balance is vital, even if it is rooted in subjectivity. If the community believes that a champion is overpowered, there will be more pressure on the development team to address those “issues.” We could encapsulate the problem space into two dimensions, an estimate of the champion’s true power and a measure of the perception of that power. To further simplify, we can plot these axes on a “Magic Quadrant” visualization:
There is little to say about the champions that lie along the upward diagonal axis; the further away they are from the center of the chart, the more a nerf or buff is needed. Fortunately, these changes should avoid contention among the game’s community, as the expectation and reality are aligned.
The Sleeper Zone
However, the other two quadrants are where things get interesting. In the top left are champions who are truly overpowered but also perceived as underpowered. I like to label this the “Sleeper Zone,” an area for champions who may become unexpectedly successful after being ignored. A player may be interested in leveraging champions within this quadrant to take advantage of this misaligned perception to catch opponents off guard.
How should we approach balancing champions in this area? Once again, there should be little contention in nerfing a champion perceived as underpowered. Unfortunately, the champion’s current audience will be frustrated, but this will always be the case regardless of the champion’s location in the Magic Quadrant; removing the potential for a champion to be game-breaking in the future should be a more significant development win.
The Spicy Zone
Finally, we have arrived at the most contentious quadrant, which I will label the “Spicy Zone,” where the champions are generally underpowered but receive many complaints from players about being “broken.” Ironically, many champions tend to occupy this location; after all, every lost game must be due to a broken champion on the opposing team, right? In any case, what should we do for the most prominent offenders in this zone? If you directly buff the champion, you’re likely to provoke rage in the community. If the champion is already popular enough, which is often the case for those in this quadrant, perhaps changes aren’t needed. Or there could be design changes to items, objectives, or other champions that can obfuscate better than a direct buff to the champion. However, we still want to continue to monitor the perception of the champion’s strength after these changes.
While we could speculate on more design decisions that may bring misbalanced champions back into the fold, we first need to figure out how to approximate each champion’s values across these aspects, allowing us to identify which champions need further assessment. The rest of this article will focus on the data frameworks to do that successfully.
The “Perceived Power” Axis
We’ll first tackle the perception component of the framework. To understand this axis, we need to incorporate feedback directly from our players on their perceptions. What is the best way to collect and quantify this information? A potential candidate is to use a MaxDiff survey, which asks players to compare a set of champions and choose the pair that reflects the maximum difference in strength; players make two selections from each group, the most and least overpowered, as below:
If we carry out this process with enough players and ask them to compare multiple different sets, we can stack rank the champions against each other effectively.
One difficulty in this process worth highlighting is when a sizeable character pool is present. In most MaxDiff studies, we want every participant to see each item around three times. This is reasonable for a game such as VALORANT, where there are only 20 champions, as we have only 60 different instances we need to test (20*3). We could split that into ten assessments, including six agents each.
However, for League of Legends, asking a player to compare 160+ champions in this manner is unrealistic, and survey fatigue is likely to kick in. Fortunately, the process will still work if a player does not see each champion three times (or even once), provided our player sample size is adequate. We can aggregate the results across players as long as we have randomly compiled the comparison sets within each survey. A simple count analysis ranks these responses based on a simple score, which we compute using the formula below:
This process will yield values between 1 (OP) and -1 (Weak), which are relatively intuitive. Note that this formula is not tethered to a single player, so our assessment is still accurate. However, in situations where a player can cover all sets, we may use Bayesian hierarchical modeling to get a better read on the champion utilities.
One potential extension to make to our MaxDiff analysis is to weigh the probability of selecting a champion for comparison based on their play rate. The intuition here is twofold. First, it is better to have more data on champions who are more likely to impact overall balance. Second, we expect players to choose frequently played champions as overpowered more often than their niche counterparts. This intuition warrants further exploration of how we weigh these score perceptions based on champion popularity.
The “Actual Power” Axis.
Now that we have a range of player perceptions, we can shift to the other axis. Oh, but this one is easy, right? We’ll just look at all the games for a champion, calculate their aggregate win rate, and then order them based on that statistic; done.
Why might this not be sufficient? What potential issues could arise that make this calculation more complicated than it seems? If we took all of the win rate data for all time, we could stack rank the champions pretty well, but we need to be more timely, as we only care about data that reflect the current game state. Due to this timeliness, we will only seek data from the most recent patch, which could hinder our confidence in our calculated win rate percentages. Let me illustrate.
How would you stack rank the power of the three champions above? Which ones should we flag for being currently overpowered? The smaller game counts mess up our intuition about naïve ordering from before. Our framework needs a way to digest both the win rate and the total number of games that back it.
Lazy Approach
The easiest solution would be to ignore champions that have yet to meet a certain number of games. We could argue that they aren’t currently affecting the state of the game. How will we decide this? What number sounds good to you? 420 games? 869 games? Oh, it needs to be a nice clean round number? Why don’t we go with 500 games?
In this scenario, we would ignore the win rates for Soraka top and Taric jungle. However, we’ve only kicked the can further down the road. Our system now looks like this:
Who is the most potent champion of these three? We still may be willing to shed the information provided by the game count and claim that Ornn is the strongest since he has the highest win rate; however, if we had chosen 1000 games as our threshold or Ornn had one less game, he wouldn’t have been a contender. The selection of this threshold makes the framework feel uncomfortably arbitrary. To make matters worse, if we’re further splitting the data up by position, game mode, and ranks, then it is likely we won’t be able to assess the majority of these combinations until deep into the patch, which blocks our ability to make timely balance decisions. We can’t afford to choose a simple conditional of the game count, but how will we embrace that information otherwise?
“Normalizing” by Game Counts
It should be intuitive that the higher the game count on a champion, the more accurate the win rate calculated from those games. In contrast, the aggregate win rates for champions with lower game counts will vary in accuracy. We need a way to “normalize” against the game count. One idea is to build a confidence interval (CI) for each win rate. There are multiple ways to do this, the most common being the Normal approximation interval, also known as the Wald interval:
You have likely seen this at some point if you have taken a statistics course. Due to ease, this is the most widespread method for calculating binomial CIs, but statisticians have heavily contested the technique. A few more precise yet less common intervals include the Wilson score, Jeffreys, Clopper-Pearson, and Agresti-Coull. While the theoretical differences between these are interesting, we’ll hand-wave the details for now, as our approach will be the same. In our example, we will use a more accurate method, the Wilson score interval, which we can calculate in Python via statsmodels; a simple online calculator also exists. Here is a table of the CIs at 95%:
These CIs represent where the parameter’s value, the “actual” win rate,” is most likely to reside. For example, referencing Taric jungle, there is a 95% chance that his actual win rate is between 54.24% and 72.73%. That statement has some additional asymptotic nuances, but we’ll bypass them to focus on our main task.
Now that we have constructed these ranges, we can use them for a better, although still not perfect, comparison of win rates across champions. We do this by taking what we will call the “pessimistic” win rate for each champion, the lower bound of the CI, and comparing those. Here is the ranking table output:
Even for this toy example, notice how many discrepancies there are. Ornn’s ranking varies greatly depending on which approach we take. In our lazy heuristic solution, we perceive him as the strongest, but in our CI approach, his strength is weak compared to other champions. Thresh, who had way more games than any other champion, jumps up in the rankings as there is far more confidence in his “pessimistic” win rate.
The most significant advantage of the CI approach is that it ignores the earlier arbitrary game count cutoff and allows us to assess win rates at any time after a patch. Of course, the CIs will be much larger with fewer games, but we will no longer have to “wait for the data.” If a balance issue presents itself on day one, it can be addressed more immediately under this framework.
While this approach provides a better platform for comparing champions, it does have a disadvantage; the “pessimistic” win rate in many cases will be far away from the actual win rate, and representing a champion by this number alone could be confusing to designers or stakeholders. Still, if we align our understanding around this, we can make better statements based on these intervals. For example, “Whenever a champion has a lower bound win rate of 54% or greater, we are confident in assuming there is a balance issue that we should address.”
Independence (or not)
Our solution seems adequate, but it violates the critical assumption of independence. While a few different mechanisms create dependence in this setting, the most prominent and the only one we will focus on is multiple games played on a champion by the same player.
To illustrate this, let’s say we had a single weighted coin and flipped it N times. The process is binomial, and we can build our CIs following one of the methods above; we can then estimate the “win” weight of the coin, which can be anywhere between 0 and 100%. However, a single coin does not represent our case. Instead, we have multiple coins whose weights may be different. On top of that, there are additional weights on how often we flip each coin!
Why does this matter? The variance in this setting is different than in the simple binomial case, meaning we will always overestimate it if we leverage a binomial model. The standard statistical term for this is underdispersion. The good news is that the range of our CIs is smaller than we initially anticipated, but we still need to find a new way to calculate them.
Bootstrapping Percentile Confidence Intervals
A strong contender for computing CIs is bootstrapping. The basic idea here is that we can infer information about a distribution by resampling from the data. While there are some situations where this is an invalid approach, such as distributions lacking finite variance, in most cases, it can handle a complicated distribution that would otherwise be difficult to model.
Let’s build out some toy numbers for the Taric jungle example above. The distribution could look like this:
Note that the same people played more than one of the games we recorded on Taric. Instead of 100 players, there are only 17, and each player has a different win rate on Taric. How would we bootstrap this scenario to build CIs on Taric’s mean win rate? We first need to sample a Taric player from a multinomial distribution, weighted on their game count. Once selected, we sample from a binomial distribution, where the win probability is conditional on the chosen player. If we do this 100 times and calculate the mean, we have one potential outcome for the mean of this distribution. If we repeat the process 10,000 times and order all those samples means by value, we can approximate a 95% confidence interval by selecting the 250th and 9750th items. We can cobble together the process in Python as follows:
For our hypothetical scenario, this yields a CI for Taric jungle of [56%, 74%], slightly different from our Wilson score CI of [54.24%, 72.73%]. The process produced a tighter range and increased the value of the lower and upper bounds. If we wanted to reinstate independence and use a uniform probability for our binomial distribution, we could calculate a CI for direct comparison to our Wilson score CI. The previous python script only requires a minor adjustment:
#Ignore game sampling and set win rate % to be uniform.
games_pct = [1]
win_pct = [0.64]
The CI outcome is [54%, 73%], closely aligned with our previous Wilson score CI. One thought-provoking observation about our bootstrapped CIs is that they are discrete. After all, the binomial process is discrete, and the support for the outcomes is limited to {0, 1/n, 2/n, … 1}. Since the Wilson score intervals take on “impossible” values, does this mean they are flawed? Not necessarily. Some intriguing oscillatory phenomena occur when using closed-form expressions for CIs, but the approximations are still reasonable. I will hold back on the lengthy details, but if I have piqued your interest, check out this paper.
Game Weighting vs. Player Weighting.
Before we wrap up, one last thing to contemplate is whether weighting by games is the most relevant for balance decisions. We may be less worried about a few players being able to exploit a specific champion and care more about the general population. In this case, we might change our sampling procedure to ignore weighting by game. In our Taric example, instead of sampling from the 100 games, we could sample uniformly from the 17 players. As expected, this will drive the average win rate downwards (to 47.5% here), but it may better represent our balance objectives. The modification of the above code:
#Only change sampling on the player selection, not the win rate %.
num_of_players = 17
games_pct = [1/num_of_players]*num_of_players
The CI due to player weighting is now [41%, 60%], much lower than anything we have observed. However, be careful; this shouldn’t be compared to our previous CIs as it represents a different statistic.
In reality, it is likely that both these numbers are valuable, so displaying them side by side or combining them into a single formula would be a worthwhile endeavor. As the number of games on a champion increases, the distinction matters less, as the two values should converge. However, since champion experience plays a significant factor in win rates, it is expected that the player-weighted win rate will always be lower than the aggregate win rate.
Summary
While we have an initial solution, we have only scratched the surface of what is possible. Some additional nuances to explore:
- More extensive incorporation on how player experience impacts win rates; would require a complete statistical model instead of a single CI calculation.
- Trending direction of champion win rates immediately after a patch; do they start optimistic or pessimistic in comparison to their long-run rate — many factors at play could suggest either direction.
- Treatment of new champions, an extreme case of the previous bullet point; much higher variance in estimating their long-term trends.
- Using previous patch information, a value exists in using this data as an informative prior instead of wiping the slate entirely clean each time — leveraging it has tremendous upsides.
- Dependence amongst games; certain champions are more likely to be picked against each other, and others are more likely to be played together.
- Win rate parity is not necessarily optimal; we expect a natural waxing and waning in champion strengths to support a healthy and evolving meta — if every champion has a 50% win rate, it might limit player exploration (in reality, this statement is a pipe dream for systems with many variables, and appears to be a shield for poor or non-existent balance frameworks)
While it is clear this journey isn’t complete, these nuances do not mean we should continue to accept the status quo. They merely serve as a future invitation to keep iterating and improving our methodology. The less-than-optimal solutions we explored have already put us in a much better state than where we started! Stay tuned for the next iteration!
Thanks: I wanted to thank Katrina Weil for helping me better grasp the MaxDiff analysis for the “Perceived Power” section; I attribute most of the thoughts within that section to her!
Disclaimer: While I have worked at Riot Games in the past, I have not worked directly with the designers on champion balance, so I have little context regarding how they make (or don’t make) balance decisions.