Battle Royale Gaming Analysis
In recent years, battle royale formats have become a popular option in online gaming. These games normally consist of approximately 100 players. The goal is to be the last player alive. The game is played on a fixed map, where players are constrained within specific boundaries, like an island. These maps contain a variety of locations where players can choose to start, ranging from small isolated buildings, to large sprawling cities or industrial and military complexes. Once players land, they search for weapons, armor, and other items that will help them win the match. You will generally find better items in large cities, so most people choose to start in these locations. As the match progresses, the players are forced towards each other as the boundaries of a randomly generated safe zone shrinks. Anyone outside the safe zone slowly loses health, and if they stay outside of the safe zone, they lose. Littered throughout the map are vehicles that allow you to travel from city to city to find items, or to help you reach the safe zone more quickly. These matches can be played with teams ranging anywhere from 1 to 4 players.
Since players have a choice of where to begin, and the object is not to kill other players, but to be the last surviving player, there is a strategy choice. Should players land in the middle of a city where they will have to immediately fight other players, but they have a better chance of finding more powerful items? On the other hand, is it more effective to start in a remote location, find a sufficient weapon, let everyone in the large cities fight each other, then be left to fight the few remaining players at the end of the match?
Our models attempt to answer this question by using data found from Kaggle that has details around a few thousand matches. The data contains statistics around each player in a given match and how many people they killed, how far they walked, how far they traveled in a vehicle, their overall rank within the match, how much damage they inflicted on other players, the amount of time that they survived, and several other factors.
The first thing we did in the process was to explore our data, in its raw form. The data contains nominal and interval attributes. To compare different strategies, we converted some of the interval data to binary data. Specifically, the ranking attribute listed the place in which each player finished the match, with 100 being the first person killed, and 1 being the winner. From this data we created a win attribute and a top 10 binary attribute. The new win column was a 1 if placement was 1, and a 0 otherwise. We also had a few suspicions about the data that we would attempt to confirm later user some additional SAS nodes. The first issue with our data was going the be the collinearity between this placement field and whether the player won the match, so we would need to remove that from all the models. However, there would probably be a few other issues with a few data points. First, there was a column that held data about how long a player survived. The player with the longest time came in first or second, so that data is dependent upon other choices the player makes during the match, and while it looks like an excellent predictor, it is really another target variable, which we are not trying to estimate in this model. The other attributes that may or may not be an issue is the amount of distance a player walked or rode. Players who dies immediately would not be able to walk anywhere, so this was somewhat dependent upon the choice the player made in where to start the game. For these, we wanted to add additional columns to show the average distance traveled per minute, so we calculated these fields as the distance walked divided by the total time survived, which is in second, multiplied by 60. Now we had data that showed the average distance a player walked and rode a vehicle per minute that they played the game. We would review these data point in the next step.
Next, we imported the xls file into SAS miner by copying over to the client machine and selecting an import node. We left all the default options, so that we could use this as training data. For the variables, we chose to drop the player survival time, as indicated earlier. We also made sure that our win variable was set to a binary target. We did not drop our survival time attribute, because we did want to recreate some of the excel calculations in transformation node.
We then used a stat explore node, a Variable clustering node, and a multiplot node to get a sense of distributions and any collinearity. The variable cluster plot was interesting but contained no real surprises. That can be found here. It broke down the data into 4 clusters. Cluster 1 contained attributes about how much damage a plyer did to other players, whether that was player kills, total damage inflicted, and the number of times a player assisted a teammate in a kill. The next cluster contained data about the type of game played, including whether it was a team game and how many total people were in the match. This can be up to 100, but most of the time, it is less due to matchmaking algorithms. The final two clusters were made up of the distance a player walked along with our average distance walked per minute, and the distance a player rode along with the average distance ridden per minute.
The Stat Explore results showed us the variable worth, which was no real surprise. It showed the player survival time being the highest value, and we knew we would need to drop that variable from any model, since it was dependent. It also showed that our data was very skewed, with only 1 input variable being within -1 to 1. The multiplot output, seen here confirmed the same. One additional piece of information we found as that there were three different data sets included here, which can be seen in the party size variable, where there are three distinct distributions. While this is not specifically a predictor to win, it might play into the strategy differences between different types of games, where there may be different optimal strategies based on whether you are playing alone, with a team of two or with a team of 4. We may be able to see this with any interactions in the regression models, or specific branches in the decision tree. It also might make sense to run three separate analyses, if you are trying to determine the optimal strategy for a specific game type. In this case, we are not, so we will not filter out any data or run separate models.
One issue with the data that we may need to account for is the fact that there will be far more data points where the win and top 10 attributes are yes, since only 1 team wins each match. The stat explore node shows us the distribution below. If there were a cost/profit model associated with playing the game and winning, we could employ that. For example, if we wanted to determine if it was worth it to enter a tournament that cost $X to enter, and had a prize pool of $Y, we could use the profit loss functionality to rank the models.
Role Name Role Level Count Percent
TRAIN win TARGET 0 97229 97.229
TRAIN win TARGET 1 2771 2.771
Even though we had already completed this task in excel, we added a transformation node, to gain experience in transforming data within SAS. Here we would attempt to create the calculations that were done to create the average distance walked and the average distance rode per minute. In the dialogue box, we created two new formulas using the formula builder, shown here. Finally, In order to avoid overfitting our model, we added a data partition node, and split the training data and the validation data to 75% and 25% respectively. With all this preparation complete, we could begin our models. The one thing we would need to do for each remaining node is to make sure that we set the columns that were calculated in excel, and the player survival time column to Use = No in each of our models.
The first step in creating a decision tree was to run a maximal tree to get an idea of the optimal number of leaves we would need before we began to overstrain our model. For this, we created a decision tree node, and linked it to the output of the partition node. Then, we clicked on the interactive dropdown box in the decision tree properties. This gave us a diagram of a root node, where we could then right click it and select a ‘train tree’ option. This gave us an initial misclassification rate of 0.0275, which seemed pretty good, however, the model showed evidence of overtraining, here, as the misclassification rate for the validation set rose, and the rate dropped for the training set at 32 leaves. Next, we created 4 different decision trees, all having a different number of maximum levels, and ran them all. The decision tree with 7 levels maximum had a misclassification rate of .062928, which was slightly better than the previous tree, seen here. This tree had an optimal number of leaves at 32. The tree with 6 levels had a misclassification rate of 0.0276, slightly worse than the tree with 7 levels, but this tree had an optimal number of less than 20 leaves, seen here. Both the trees with a max of 4 and 5 levels had an optimal number of leaves at 5, seen here and here, and a misclassification rate of 0.0275. In addition, we also ran a model comparison on all of our decision trees and found that the model unsurprisingly chose the tree with 7 levels, which was the tree with the lowest misclassification rate. Those results can be found below:
Train: Valid:
Valid: Average Train: Average
Selected Model Misclassification Squared Misclassification Squared
Model Node Model Description Rate Error Rate Error
Y Tree3 DTree 7 level 0.026928 0.022524 0.026707 0.022639
Tree DTree 6 Level 0.027164 0.023479 0.026923 0.023513
Tree5 Decision Tree (5) 0.027516 0.021574 0.027584 0.021563
Tree2 DTree 5 Level 0.027536 0.024980 0.027345 0.024897
Tree4 DTree 4 Level 0.027536 0.024980 0.027345 0.024897
For the tree with the best misclassification rate, the biggest factor was a kill count of above or below 4.5. With 4 kills or less, the players had a 2% chance of winning, while players with 5 or more kills had a 98% chance of winning. Players with the lowest chance of winning were players who kill less than 4.5 players, walk between 1681 and 2378 meters, and ride less than 213 meters per minute. They had a 0.3% chance of winning. Players who kill between 4.5 and 12.5 people, walk more than 2935 meters, and walk less than 100 meters per minute had the biggest chance of winning at 94.6%
It is also worth reviewing a simpler decision tree, since the misclassification rate is not much lower, and the decision points are much simpler. Here, we see that player kills are the main decision points that matter. Under 4.5 kills, players have a 2% chance of winning. Between, 4.5 and 8.5, those chances rise to 20.9. From 8.5 to 12.5 kills a player has a 45% chance, and over 12.5 kills a plyer has a 70% chance of winning. This does not contradict the more complicated assessment, it just validates that kills are the most important fact, and there are other nuances that factor into whether or not a player is likely to win a game.
Next, a regression model was performed. The first thing that was done to run a regression was to run with the full model to see how well the variables determine a winning outcome. After completing the initial run, there are a few promising things to note. While reviewing the misclassification rates, the results were an astonishingly low at .02480 for the training and validation sets. Another important piece to note are the odds ratios for each of the variables. Despite not being the primary goal of the match, the odds ratio for number of player kills is 1.607, and coming in second, the number of assists had an odds ratio of 1.501. This leads us to believe that there is great credit to the ‘aggressive strategy’ style of play to the game. Unlike the decision tree, the significant variables for that model were deemed to have minimal to no impact. In terms of distance walked and driven and distance per minute all had odds ratios within 0.9 and 1.0. This is somewhat contradictory to the other, conservative strategy of gameplay; in that laying low and waiting for later in the match is a more successful method to win. For an added sanity check, a stepwise regression was also run to determine if the model could be further improved. Upon further analysis, there was an identical result in all odds ratios, p-values, and misclassification rates. This conflict between our first two models forces us to continue with our analysis. The misclassification rates of each of the regression models can be found below:
Train: Valid:
Valid: Average Train: Average
Selected Model Model Misclassification Squared Misclassification Squared
Model Node Description Rate Error Rate Error
Y Reg Stepwise 0.024080 0.016928 0.024161 0.016946
Reg2 Full 0.024080 0.016928 0.024161 0.016946
For the neural networks, we created several different networks, but we first pulled in a variable selection node in order to determine which variables the neural network would use to create a model. We connected that output to the input of models where the hidden layers were set to 2 and 3, respectively. For the model with 2 hidden layers, there were 15 weights and the 3 layer neural network had 22. The three-layer model had a slightly better misclassification rate, and the ROC index was less than .0001 better.
Train: Valid:
Valid: Average Train: Average
Selected Model Misclassification Squared Misclassification Squared
Model Node Model Description Rate Error Rate Error
Y Neural NN 3 hidden Layers 0.027128 0.022716 0.027159 0.022741
Neural2 NN 2 hidden layers 0.027204 0.022720 0.027175 0.022749
Up to this point, we had been creating different types of neural nets, different types of regressions, and different types of decision trees. For each grouping, we connected them to a model comparison, that found the best respective model of each type. To compare our different ‘best’ models from each group, we took the outputs from the three different model comparisons we had done, and connected those output to the input of one final model comparison. The results of that model comparison, which compared our best neural network, our best regression model, and our best decision tree model, can be found below:
Train: Valid:
Valid: Average Train: Average
Selected Model Misclassification Squared Misclassification Squared
Model Node Model Description Rate Error Rate Error
Y Reg Stepwise 0.024080 0.016928 0.024161 0.016946
Tree3 DTree 7 level 0.026928 0.022524 0.026707 0.022639
Neural NN 3 hidden Layers 0.027128 0.022716 0.027159 0.022741
The Regression model came in first with the lowest misclassification rate of 0.024080. The ROC results, which can be found here, show that the regression model was considerably better. The ROC index for the regression model was .966, while the Decision tree and the Neural network had .874 and .895 respectively. However, upon thinking about this further, the best misclassification rate is not the best model in our purposes. When reviewing the output of all the models, just because the regression model has the best misclassification rate does not make it the ‘best model’. the decision tree model is the model that gamers want. Even though it does not have the best misclassification rate among all the models, it is the best model for this situation and for our audience. Everyone from the casual gamer, up to the worldwide competitive eSports professional gamer can utilize a model like this and provide a guide as to “what are the best chances to win?”. In the end, there is no clear-cut answer of “how do I win?” but only a “what are my best chances?”. In the realm of competitive video games, it is unpredictable how to play against another person who is on the other side of the world and it is impossible to predict how they will react to certain situations. The only thing that can be done is to practice several different strategies and perfect your ability to react to all the differing situations that could occur during a match.
The path to winning a match seems to show a good balance of getting into conflicts and traveling around the map, but the less average travel per minute, the better. This could mean that players who start in the center of the map fare better, since they would have less to travel on average. Also, since the end of the match always comes down to two teams fighting each other, it makes sense that the players who actively engage in fighting other teams (players with more kills) would win. They have more experience and practice doing that, as opposed to other teams that have avoided conflict the entire match. In addition, as players win battles throughout the course of the match, they have the chance to scavenge the items that their opponents have collected throughout the course of each match. In effect, each time you kill someone, you have effectively searched the locations where they have chosen to start, increasing your chances of finding items that are more helpful for winning the match.
In our last step, we pulled another data set from Kaggle with the same attributes and we scored our model. We did this by creating another import node, but this time, we set the role to score. We then took the that file and the output of our model comparison node and fed it into the score node. Upon reviewing the results, we saw that the neural network was very good at predicting wins, even with all the issues we had with our data. Additionally, the only times that the model was wrong was when it erroneously predicted a win when the player did not win. In those cases, when the model predicted a win, but the player did not, the players finished no higher than 8th place.
Final Model:
APPENDIX
Variable Clustering Plot
Variable Worth
Variables by win
Formulas
Top 10 Maximal Subtree Assessment Table
Train Node Default Subtree Assessment plot:
Misclassification rate: .027516
Subtree assessment plot, max 7 levels
Misclassification rate: .026928
Subtree assessment plot, maximum 6 levels
Misclassification rate: .027164
Subtree assessment plot, max 5 levels
Misclassification rate: 0.027536
Subtree assessment plot, max 4 levels
Misclassification rate: 0.027536
Regression model odds ratios and confusion matrix
Neural Network 3-layer weights
Neural network 2-layer weights
Neural Network Model Comparison
Final Model ROC Curve
Resources
- 24 x 7 Availability.
- Trained and Certified Experts.
- Deadline Guaranteed.
- Plagiarism Free.
- Privacy Guaranteed.
- Free download.
- Online help for all project.
- Homework Help Services