Preliminary

Mobile Legends: Bang Bang is a MOBA game (Multiplayer online battle arena) for mobile devices with Android and iOS developed by Shanghai Moonton. The game was originally released in Asia on 11th of June 2016.

In the game there are 2 opposing teams consisting of 5 players each. Players choose a character they will play with before game starts. As for now there are around 90 champions (character) to choose from. Each character is unique and may be used for different purposes depending on their skills and abillities. In that way one can distinguish mages, assasins, fighters, supports, tanks and marksmen. Main task is to destroy enemies’ defence towers resulting in concquering their base.

The game was getting more and more attention in Poland for a couple of years now. The graph below presents interest over time for google query “Mobile Legends” and “MOBA” in Poland. As you can see around 2017 there was a huge increase in popularity of Mobile Legends while interest in MOBA games in general was falling down gradually in past 5 years. However in March and April 2020 they experienced a rapid renaissance. We can probably associate it at least in part with lockdown caused by COVID-19 outbreak.

As stated above there are several types of characters in that game so we will try to check whether it is reflected in the data or the characters are labelled artificially. In order to do so we are going to implement Principal Component Analysis to reduce dimentionallity and then hierarchical algorithm to cluster the characters. Although the labels are known as such this analysis may be helpful for:

  1. choosing an alternative character if the one you want to play with is unavailable

  2. discovering underlying forces generating skills

  3. maintaining characters skillsets in a balance way,

Data

First we have to collect the data. As there is no official site with the data on champions characteristics we will scrape it from mobile league wiki site. Let’s check robot.txt file before we start.

paths_allowed("https://mobile-legends.fandom.com/wiki/Mobile_Legends_Wiki")

The upper command returns value TRUE. That’s nice - we are allowed to scrape their data. For that purpose we will combine rvest package and selector gadget widget. Whole scraping/wrangling code is provided in a speparate Rmd file in GitHub repository.

Let’s have a look on how our data looks like. In the table below you can find all characters in alphabetical order.

One important remark is although the list below present all playable characters right now we will consider it sample since the characters set is being constantly updated with new characters - in that way statistical inference can be justified.

Id Hero Movement speed Magic Resistance Mana HP Regen Rate Physical Attack Armor Health points Attack speed Mana regen rate Role
1 Akai 260 10 422 42 115 24 2769 0.8500 12 Tank
2 Aldous 260 10 405 45 129 22 2718 0.8360 18 Fighter
3 Alice 240 10 493 36 114 21 2573 0.8000 18 Mage
4 Alpha 260 10 453 39 121 20 2646 0.9160 16 Fighter
5 Alucard 260 10 0 39 123 21 2821 0.9000 0 Fighter
6 Angela 240 10 515 34 115 15 2421 0.7920 18 Support
7 Argus 260 10 0 40 124 21 2628 0.9160 0 Fighter
8 Atlas 240 10 440 42 135 0 2819 0.7860 15 Tank
9 Aurora 245 10 500 34 105 17 2441 0.8000 23 Mage
10 Badang 255 10 0 40 119 23 2708 0.9080 0 Fighter
11 Balmond 260 10 0 47 119 25 2836 0.8500 0 Fighter
12 Bane 260 10 433 42 117 23 2659 0.8500 12 Fighter
13 Belerick 250 10 450 62 110 20 3109 0.8100 12 Tank
14 Bruno 240 10 439 30 128 17 2522 0.8500 15 Marksman
15 Carmilla 197 13 477 45 118 10 2378 NA 34 Support
16 Cecilion 265 15 574 32 165 23 2425 NA 26 Mage
17 Change 240 10 505 34 115 16 2301 0.8080 21 Mage
18 Chou 260 10 0 39 121 23 2708 0.8840 0 Fighter
19 Claude 240 10 450 40 137 14 2370 0.8260 15 Marksman
20 Clint 240 10 450 36 115 20 2530 0.8420 15 Marksman
21 Cyclops 240 10 500 38 112 18 2521 0.8000 20 Mage
22 Diggie 250 10 490 36 115 18 2351 0.8000 20 Support
23 Dyrroth 266 10 0 41 117 19 2758 0.9160 0 Fighter
24 Esmeralda 240 10 502 36 114 21 2573 0.8000 20 Mage
25 Estes 240 10 545 36 115 13 2161 0.8000 18 Support
26 Eudora 250 10 468 38 112 19 2524 0.8000 16 Mage
27 Fanny 265 10 0 33 126 17 2526 0.8940 0 Assassin
28 Faramis 260 10 0 39 222 36 3700 0.9400 19 Support
29 Franco 260 10 440 46 116 25 2709 0.8260 10 Tank
30 Freya 260 10 462 49 109 22 2801 0.8760 14 Fighter
31 Gatotkaca 260 10 440 42 120 20 2709 0.8180 12 Tank
32 Gord 240 10 570 32 110 13 2478 0.7720 25 Mage
33 Granger 240 10 0 27 125 15 2490 0.8180 0 Marksman
34 Grock 260 10 430 42 135 21 2819 0.8100 42 Tank
35 Guinevere 260 10 0 39 126 18 2528 0.9160 0 Fighter
36 Gusion 260 10 469 39 119 18 2578 0.8920 16 Assassin
37 Hanabi 245 10 390 30 115 17 2510 0.8500 15 Marksman
38 Hanzo 260 10 0 35 118 17 2594 0.8700 0 Assassin
39 Harith 240 10 490 36 114 19 2701 0.8400 18 Mage
40 Harley 240 10 490 36 114 19 2501 0.8480 18 Mage
41 Hayabusa 260 10 0 37 117 17 2629 0.8540 0 Assassin
42 Helcurt 255 10 440 35 121 17 2559 0.8700 16 Assassin
43 Hilda 260 10 0 42 123 24 2709 0.8420 0 Fighter
44 Hylos 260 10 430 42 105 17 3309 0.8360 12 Tank
45 Irithel 260 10 438 35 110 17 2540 0.8260 15 Marksman
46 Jawhead 255 10 430 39 119 24 2778 0.9000 16 Fighter
47 Johnson 255 10 0 42 112 27 2809 0.8260 12 Tank
48 Kadita 240 10 495 34 105 18 2491 0.8000 18 Mage
49 Kagura 240 10 519 35 118 19 2556 0.8160 21 Mage
50 Kaja 270 10 400 52 120 30 2609 0.8420 12 Fighter
51 Karina 260 10 431 39 121 20 2633 0.9000 16 Assassin
52 Karrie 240 10 440 40 112 17 2498 0.8396 15 Marksman
53 Khufra 255 0 460 47 117 19 2709 0.7860 15 Tank
54 Kimmy 245 10 100 40 104 22 2450 0.8260 0 Marksman
55 Lancelot 260 10 450 35 124 16 2549 0.8700 16 Assassin
56 Lapu-Lapu 260 10 0 35 119 21 2628 0.9000 16 Fighter
57 Layla 240 10 424 27 130 15 2500 0.8500 14 Marksman
58 Leomord 240 10 0 35 128 25 2738 0.8440 0 Fighter
59 Lesley 240 10 0 36 115 14 2490 0.8260 0 Marksman
60 Ling 260 10 0 39 119 18 2578 0.8920 0 Assassin
61 Lolita 260 10 480 48 115 27 2679 0.7860 12 Tank
62 Lunox 240 10 540 34 115 15 2521 0.8080 23 Mage
63 Lylia 245 10 500 34 113 17 2501 0.8080 19 Mage
64 Martis 260 10 405 35 128 25 2738 0.8680 16 Fighter
65 Masha 312 10 101 19 NA 12 1948 NA 0 Fighter
66 Minotaur 260 10 0 44 123 23 2759 0.7300 0 Tank
67 Minsitthar 260 10 380 37 121 23 2698 0.8520 16 Fighter
68 Miya 240 10 445 30 129 17 2524 0.8500 15 Marksman
69 Moskov 240 10 420 32 125 16 2530 0.8140 15 Marksman
70 Nana 250 10 510 34 115 17 2501 0.8640 18 Mage
71 Natalia 260 10 486 35 121 18 2589 0.9020 16 Assassin
72 Odette 240 10 495 34 105 18 2491 0.8000 23 Mage
73 Pharsa 240 10 490 34 109 15 2421 0.7900 18 Mage
74 Rafaela 245 10 545 36 117 15 2441 0.7920 23 Support
75 Roger 240 10 450 36 128 22 2730 0.8420 15 Fighter
76 Ruby 260 10 430 30 114 23 2859 0.8580 14 Fighter
77 Saber 260 10 443 35 118 17 2599 0.8700 16 Assassin
78 Selena 240 10 490 34 110 15 2401 0.8040 18 Assassin
79 Silvanna 255 10 430 39 126 22 2828 0.9160 16 Fighter
80 Sun 260 10 400 41 114 23 2758 0.9160 16 Fighter
81 Terizla 255 10 0 54 129 19 2728 0.8200 0 Fighter
82 Thamuz 255 10 0 39 123 24 2758 0.8600 0 Fighter
83 Tigreal 260 10 450 42 112 25 2890 0.8260 12 Tank
84 Uranus 260 10 455 32 115 20 2689 0.8340 12 Tank
85 Vale 250 10 490 34 115 15 2401 0.8000 21 Mage
86 Valir 245 10 495 34 105 18 2516 0.8000 18 Mage
87 Vexana 245 10 490 38 112 17 2421 0.8000 20 Mage
88 Wanwan 240 0 424 27 100 0 2540 0.8260 14 Marksman
89 X.Borg 260 10 0 39 117 25 1138 0.8680 0 Fighter
90 Yi_Sun-Shin 240 10 438 36 110 18 2520 0.8000 15 Marksman
91 Zhask 240 10 490 34 107 15 2401 0.8000 20 Mage
92 Zilong 265 10 405 35 123 25 2689 0.9640 16 Fighter

One important thing we should be interested in is the variability of champions characteristics becasue if there is no variability at all or just a little even the most sophisticated analysis would be redundant. Below you can see the coefficient of variation (in %).

Movement speed Magic Resistance Mana HP Regen Rate Physical Attack Armor Health points Attack speed Mana regen rate
5.04 16.19 59.37 15.99 11.76 26.35 10.18 5.18 63.65

The variabiliy of mana and mana regeneration exeed 60% - that is exactly what we were looking for! Health points regeneration and armor vary for about 16% and 26% responsively - not that much but also fine. Although the coefficient for magic resistance is at the level of 16% the value of that abillity is constant almost for every character so anyway we will drop that variable in further analysis. In any case we will have to check the data for outliers as some of those values might be inflated for instance just by a single or two observations. Rest of the variables vary just a bit (most of them under 10%).

Now let’s look on some possible relationships and check distributions of the variables.

We can see some relationships - f.e. mana vs. mana regeneration and health points vs. armor and many more - we will investigate them soon.

Density functions for variables movement speed, mana and mana regenerations seem to be bimodal - it is a clear sign there are some subpopulations in our “sample” so it is reasonable to proceed with cluster analysis.

There are some outliers - note a champion whose health point regeneration ability is about 2 times more powerful than the mean for the sample. We can also see a champion whose health points ability and attack points are extremly high. For the sake of analysis we will remove both of them from our “sample” so that they will not affect clustering results in a significant way. Let’s find out who are those people.

Id Hero Movement speed Magic Resistance Mana HP Regen Rate Physical Attack Armor Health points Attack speed Mana regen rate Role
13 Belerick 250 10 450 62 110 20 3.109 0.81 12 Tank
28 Faramis 260 10 0 39 222 36 3.700 0.94 19 Support

The last thing we can do is to check the correlations and their significance - just to have general view since Simson paradox might be present.

Principal Compontent Analysis

Dealing with high dimentional data might be challenging and can lead to several problems. However in most cases it is possible to reduce the number of dimentions retaining most of the information stored in the data. One of the most widely used method that can allow us to do so is Principal Component Analysis. So what we basically want to do is to project our data matrix on some reduced-feature space using a linear transformation while restoring as much information as possible. And that is exactly what PCA does!

How does the math look like?

Let’s assume we have data matrix \(X\) consisting of \(n\) variables and \(m\) observations, so \(X \in \mathbb{R}^{n \times m}\). We want to find a linear transformation \(U\) that transforms \(X\) as follows: \[Z = UX, \text{ where } Z \in \mathbb{R}^{d \times m}, U \in \mathbb{R}^{d \times n} \text{ and } d<m.\] At the same time we want make sure we mimnimize the information loss. We can think of variance-covariance matrix as a representation of information in our data. In terms of our transformed data matrix it can be denoted as \[\Sigma = \frac{1}{N}Z^TZ, \text{ where } \Sigma \in \mathbb{R}^{n\times n}.\] Keeping that in mind searching for our transformation becomes following optimisation problem: \[\max_{U}\Sigma=\max_U\frac{1}{N}(XU)^T(XU) = \max_U\frac{1}{N}U^TX^TXU=\max_UU^T\Sigma U, \text{ where } U^TU = I.\]Note that we have to add normalization condition to make sure all of the vectors have unit magnitude because otherwise we would not be able to solve this expression as there is no upper bound. One possible way to solve such problems is Method of Lagrange Multipliers.

Firstly we construct our Lagrange multiplier as following: \[F(U,\lambda)=U^T\Sigma U + \lambda(I-U^TU).\]

Then we differentiate it with respect to \(U\) and equate to 0 as the differential should equal 0 in extremum \[\frac{dF}{dU}=\Sigma U-\lambda U.\]

We can rewrite it as \[\Sigma U=\lambda U.\]

The later looks indeed as eigenvectors equation so what we do is perform variance-covariance matrix diagonalization (eigendecopostion) to obtain eigenvectors and corresponding eigenvalues \[\Sigma = U \Lambda U^{-1}.\]

Then we can sort pairs of eigenvectors with their eigenvalues in descending order and choose top m pairs. In that way we come up with set of m eigenvectors that retain as much part of variance as following ratio: \[\frac{\Sigma_i^m \lambda_i}{\Sigma_i \lambda_i}.\].

Our U transformation that we are looking for is composed of the selected eigenvectors \[U = [u_1, ..., u_m].\]

Back to our analysis

First let’s detrmine relevant prinipal components using standarized data. As scree plot would not tell us much, we should probably choose the number of compontents based on eigenvalue rule of thumb. Each of three top components has eigenvalue bigger than 1, i.e. “contains more information than a single variable”.

Eigenvalue Variance percent Cumulative variance percent
PC1 3.19 39.84 39.84
PC2 1.40 17.54 57.38
PC3 1.02 12.73 70.11
PC4 0.83 10.38 80.49
PC5 0.74 9.21 89.70
PC6 0.45 5.62 95.32
PC7 0.25 3.13 98.45
PC8 0.12 1.55 100.00

As you can see in the table above they account for about 70,1% of data variability. That is not as much as we expected but it’s fine. We droped 5 from 8 variables and still managed to retain over 70% of variance.

Let’s have a look now on the PCA loadings so we can think of some resonable interpretations.

Original loadings
Variable PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
MV_SPD -0.45 0.22 -0.06 0.28 0 0.44 -0.67 0.1
MANA 0.41 0.47 -0.23 0.17 0.1 0.11 0.15 0.7
HP_RGN -0.33 0.43 0.36 -0.31 0.27 0.43 0.46 -0.11
P_ATK -0.23 -0.14 -0.62 -0.49 0.53 -0.1 -0.09 0.1
P_DFN -0.36 0.33 0.2 0.32 0.32 -0.72 0 0.05
HP -0.22 0.43 -0.25 -0.4 -0.7 -0.24 0 0
ATK_SPD -0.36 -0.14 -0.49 0.53 -0.14 0.14 0.54 -0.1
MANA_RGN 0.4 0.46 -0.3 0.14 0.19 0.02 -0.11 -0.68
Rotated loadings
Variable PC1 PC2 PC3
MV_SPD -0.71 -0.27 -0.4
MANA 0.14 0.91 0.22
HP_RGN -0.83 -0.15 0.12
P_ATK -0.01 -0.09 -0.76
P_DFN -0.75 -0.2 -0.07
HP -0.57 0.23 -0.34
ATK_SPD -0.18 -0.28 -0.75
MANA_RGN 0.15 0.92 0.14

As it is hard to interpret the Principal Components in that framework we may want to rotate the whole system to obtain more intuitive interpretations. For that purpose we take 3 top Principal Compontent and use orthogonal VARIMAX roation. We do not change cooridanate system - we roate the orthogonal basis to allign with those coordinates. In that way we assure that squared correlation between variables and factors will be maximized. On the right you can see two tables with loadings before and after rotation responisvely.

The most obvious interpretation has definitely PC2. The loading on mana and mana regeneration are very high so the underlying force here is magic.

PC1 has relatively high loadings (in absolute values) on health points regeneration and armor so we would lean towards some kind of durability interpretation.

PC3 is driven mostly by attack points and attack speed so one can interpret it as readiness to fight.

Now let’s check whether it is possible to distinguish some clusters just by looking at the rotated score plots. We can clearly see two clusters or maybe three… Next section will help us to understand what is going on.

Clusters

Now as we reduced dimentionality we can proceed to the most exciting part of our analysis - clusters distinguishment. In that part we will implement hierarchical algorithm to see wheter there is an underlying data structer to discover.

First let’s start with computing distances between observation. For that purpose we will use second order of Minkowski metric, i.e. Euclidean distance. Below you can see vizualisation of discussed distance matrix.

There are two types of hierarchical clustering methods in general. In the first one at the beginning every data point is a separate cluster, then we connect the closest ones with one another based on chosen distance metric and criteria (see below) till all data points are in one cluster. Because of that fact we often refer to this approach as agglomerative or bottom-up clustering. In the second type it is the opposite - at the beginning all of the data points are in one, big cluster then we separate them till every cluster consists of just one data point. This approach is called devisive or top-down clustering.

Another issue is the choice of data points we are going to calculate the distance between. Also in this case there are several possibilities. The most widely used ones are single linkage, complete linkage, average linkage and Ward’s method. In single linkage approach we connenct the to-be-connected sets based on the data points that are closest to each other. Complete linkage works in the opposite way - we connect the data points based on the maximal distance between the sets. Average method is a compromise between these two approaches. Ward’s method is a bit different than the previous ones. Using that method we create clusters for which the variance witihin the groups is minimized.

In our analysis we will use the agglomerative algorithm with euclidean metrics and Ward’s method of linking.

One huge advantage of hierarchical clustering methods over for instance k-means clustering algorithm is that we can get a valuable insight of the data structure by looking at the so called dendrogram. Domain knowledge is very helpful at that moment as it might get way easier to work out the number of clusters and their possible names. Below you can see a dendrogram based on methodology we chose presenting 4 different clusters.

Although the choice of the threshold, (i.e. how many cluster we want to distinguish) is arbitrary we can look at some metrics that can reflect something similar to goodness of fit in clustering framework. For instance we can compute so called silhoutte width for every observation to see how similar that observation is with the cluster it was assigned to. If we average out all the computed silhoutte widths for different number of clusters and plot it we might get a nice insight on what is going on in the data.

As you can see above silhoutte plots for different methods of linking indicate different number of clusters. Plots for complete linkage and average linkage method suggest there are around seven to nine different clusters in our data. On the other hand plot for single linkage approach indicate there are just two distinct groups. Which one should we trust? That’s a though question nobody knows answer at first but let’s focus on the plot for Ward method. It suggests there are something between 4 and 7 clusters so let’s examine the data structure firstly for the most rough division as it is of most interest to us.

It seems quite reasonable to distinguish four clusters. On the right hand side you can see a radar plot with skills averages of characters within clusters rescaled via division with highest value. What we can observe:

Blue cluster is the most durable one - lots of points of defense, well developed health points regeneration and also highly unmagical. There are probably tanks over there.

On the other hand the red one is characterized with strong magical skills like mana or mana regeneration and at the same time is very weak physically. Those are most likely mages and/or other champions with some magical skills.

Third cluster - the yellow one - is the most balanced one and but with most points of attack. Here we have characters we could probably fight with on the front line.

The last group - green one - is the most peculiar one. Those are the most unmagical characters with quite a lot of attack speed and attack points. Here we have somekind of assasins or something similar.

Such division in four clusters is quite satisfying so we are going to stay with it as more clusters do not give much more insight about the data structure. There are some subgroups for instance in the green cluster but we don’t find it that interesting to show it here.

Of course we actually knew the characters labels the whole time but the purpose of the study was to see what the data says about the different groups of characters and not to classify them upon their characteristics. Anyway it might be interesting to see of which type of characters do the clusters cosist. Below you can see a plot answering that question.

As we can see the cluster we attributed durability consists mainly of tanks characters and some (probably) strong fighters. The magical cluster we distinguished (cluster 3) captures all of mages and all of support characters. Probably support champions’ abilities are also quite “magical”. Cluster 2 and 4 capture most of fighter and assasins as we thought earlier, so those are characters you would play with to fight on the front line or by blitz attack responsively.

Conclusions

As we discovered there are 3 underlying forces driving the characters skills: durability, magic and readiness to fight.

There are 4 clusters of characters you can play with. Magical ones, durable ones and two suitable for fight - either normal or sneaky one.

Different types of characters might be useful for the same purpose.

There are some outliers in the data - either they are mistakes, some weird unbalanced characters or maybe there is just something about them we don’t know ( ͡° ͜ʖ ͡°).