top of page

Predicting Whether A Player Will Be A NBA Starter

Updated: Jan 31, 2023



Abstract—Every year, players, coaching staffs, and the sport analysts like ESPN are always trying to take advantage of the abundance of data available by predicting some topic of interest. In this paper, I propose two new models for the prediction of whether a player will be a starter in NBA based on the statistical analysis of players’ performance data provided by the official NBA website and Basketball Reference. The two new models differ slightly by their accuracies, but both show their effectiveness in predict whether a player will start in NBA. The most successful models are those that include performance data such as points per game and total rebounds per game.


Keywords—NBA, Logistic Regression, Stepwise Regression, Neural Network, MATLAB.

  1. Introduction

A. NBA Player Performance Data


Every year the official National Basketball Association (NBA) website [1] and Basketball Reference [2] release various performance data of individual players, such as points per game and assists per game. The report is rich in measures however it does not inform us about how well will the players do in the future. This paper is aimed at utilizing the data collected by NBA to assess and predict whether or not a player will become a starter in NBA. Knowing what are the variables that are reliable to predict whether a player will be a starter can not only help the players better prepare themselves, but also help fans and sport analysts like ESPN to better predict who will be the starters for each team. Also, for NBA scouts, this research is especially useful to them because they can better predict if the players traded or drafted can become a starter based on their current statistics so that the team would be benefitted from the trade.


B. Research Question

The goal of this research project is to assess what are the factors that affect whether or not a player will start in NBA. The purpose of this research project is summarized in the following research question: What are the most significant predictors of deciding whether or not a player will be a starter in NBA based on the data supplied by the official NBA website?


C. Origins of the NBA, Importance of Games Started as a Performan Statistic

The National Basketball Association (NBA) is a men’s professional basketball league in North America with 30 teams (29 in United States and 1 in Canada) [3]. Throughout the season, players are recorded with their player statistics, which includes points per game, assists per game, rebounds per game, etc. One of the statistics is the Games Started in a season. NBA players value starters as an accomplishment because usually starters are the best players on the team. Games Started is also an important indicator of the MVP and the All-NBA Teams as well [4]. Therefore, a high games started percentage really proves a player’s greatness and players always strive for more games started. The results of this research will help players know what are the aspect they need to work on in order to become a starter and also help the NBA teams and scouts to know whether a player will be a starter and positively influence the team when traded.


2. Data

A. Pre Processing

Data collected from the official NBA website will be stored into one spreadsheet in excel. Most variables are continuous with some discrete categorical variables.


B. Collection and Interpretation

The data used in the research project is from the official website of NBA. The data will be from the most-recent 2018-2019 NBA season since the 2019-2020 season has not finished yet. Traditionally there are almost 30 player performance statistics. For this research paper, the player performance statistics that will be considered are listed below:

  • Games Played (G) – Number of games the player played in a NBA season.

  • Games Started (GS) – Number of games the player played as a starter in a NBA season.

  • Age – Age of the player during the 2018-2019 NBA season.

  • Points Per Game (PTS) – Average (mean) number of points a player scores in a game. This is calculated by dividing the total number of points the player scored in a season by the number of games played by the player.

  • Assists Per Game (AST) – Average (mean) number of assists a player delivers in a game. This is calculated by dividing the total number of assists the player delivered in a season by the number of games played by the player.

  • Total Rebounds Per Game (TRB) – Average (mean) number of rebounds a player gets in a game. This is calculated by dividing the total number of rebounds the player had in a season by the number of games played by the player.

i. Analysis

The analysis of this data will be conducted using stepwise logistic regression and neural network. The variables for all the categories are continuous so these will be discretized (see Methodology Section) for the stepwise logistic regression. For a more detailed description of the data analysis refer to the Methodology Section.


ii. Target Variables

To answer the main research question the target variable to predict will be whether or not a player will be a NBA starter based on other variables available. For more information about how this is achieved, a more detailed description is presented in the Methodology section.


iii. Potential Indicators

Background research has shown that whether a player starts is related to the player’s basketball ability. For the prediction of whether a player will be a NBA starter a thorough analysis of all the variables listed above will be conducted, complete details are provided in the Methodology section.


3.Methodology


This section outlines the theory behind the methods used to answer the research question stated above. Methods to be used for the analysis are explained below.


A.Statistical Tests


i. Stepwise Logistic Regression

The stepwise logistic regression model will be built using stepwise regression model in MATLAB. It is used to predict the probability of a player starting more than 80% of their total games played, and players will be classified as “true starters” (starting more than 80% of their total games played) and “bench players” (starting less than 80% of their total games played). Logistic regression is very useful for calculating probabilities of success and failure. This method will help to predict the player’s chance of being “true starters” based on his current statistics.


Confusion Matrices, Receiver Operating Characteristic (ROC) plot, and Area Under the ROC Curve (AUC) will also be generate to compute the accuracy, true positive and false positive.


Only variables that are reliable predictors in the stepwise logistic regression model for predicting whether a player will be a starter will be used in the neural network for further testing.


ii. Neural Network

The variables used in the stepwise logistic regression model are then trained using Neural Network in MATLAB to see what is the most effective model to predict whether or not a player will be a NBA starter. In the neural network, the scaled conjugate gradient backpropagation function is used with 10 hidden layers. The neural network is very powerful in making decisions like a human and help to make the best model.


Confusion Matrices, ROC plot, and AUC will also be generated for the neural network. The accuracy and AUC will be computed and compared with the stepwise logistic regression model to decide the best model to predict whether a player will become a NBA starter.


B. Addressing Research Question


Below I will discuss how the previous methods will be used to specifically address the main research question of the project, as I stated before: What are the most significant predictors of deciding whether or not a player will be a starter in NBA based on the data supplied by the official NBA website?

This question will be addressed using the stepwise logistic regression model because the stepwise logistic regression model will be the one that decides which variables are significant predictors and which ones are not. The neural network model will only use the significant predictors from the stepwise logistic regression model to possibly produce a more effective model.

4.Results

A. Stepwise Logistic Regression Model


A logistic regression model is built using the stepwise regression. The model is shown in Figure 1. The x1 and x2 in Figure 1 correspond to points per game and total rebounds per game, respectively.



1.Stepwise Logistic Regression Model.


Figure 2 shows the ROC Plot for the stepwise logistic regression model.



2.ROC Plot for Stepwise Logistic Regression Model.

B. Neural Network Model

A neural network model is built and shown in figure 3.




3.Neural Network Model.

The neural network model is built given points per game and total rebounds per game as two inputs because these are the two variables that are the most reliable predictors of whether a player will be a NBA starter, according to the stepwise logistic regression model (Figure 1). Figure 4, figure 5, and figure 6 show the ROC plot, the confusion matrix, and the best validation performance for the neural network model, respectively.




4.A Randomly Selected ROC Plot for Neural Network Model.




5.A Randomly Selected Confusion Matrix for Neural Network Model.



6.A Randomly Selected Best Validation Performance for Neural Network Model.


C. Comparison between Generated Models

A comparison between the two models is necessary to identify which one is the more reliable model to use to predict whether a player will be a NBA starter. Table I shows the comparison of accuracy and AUC for the two generated models.

  1. Comparison of Accuracy and Area Under ROC for Generated Models

Since neural network produces different results every time it is trained, the accuracy and the area under ROC is calculated by taking 10 random samples of the neural network results and take the mean of the 10 samples.



5. Discussion

A. Regression Analysis

i. Stepwise Logistic Regression Model

Figure 1 shows that the two most reliable predictors of whether a player will be a NBA starter are points per game and total rebounds per game. This conclusion makes sense in today’s NBA because out of the four variables (age, points per game, assists per game, and total rebounds per game), points and assists have a high correlation with each other for NBA players today. Normally, the starting point guard of a team has to be good at both play-making (assisting) and scoring. But that may not be the case for rebounding. There are also a lot of variations of age in starters: some players might be 35 years old and still starting (LeBron James), and some other players might be starting since their rookie season (Zion Williamson), which makes age not a good predictor of starters. Therefore, the two variables that the stepwise logistic regression model give are reasonable.


This prediction model’s effectiveness is proved in Figure 2, which shows ROC curve for the stepwise logistic regression model. From Table I, its AUC value is 0.9342, which means that points per game and total rebounds per game are reliable predictors of the whether a player will be a NBA starter.


ii. Neural Network Model

Figure 4 shows the ROC curve for the neural network model, which, similar to the stepwise regression model, is much greater than 50% (purely by chance). Figure 5 further proves the model’s effectiveness by showing that the accuracy of the model is around 86.5% (even though the accuracy of the neural network model may differ every time it is trained, it should not vary too much from one another). Figure 6 demonstrates that the best validation performance is close to 0, which is also a good sign for the neural network model. All of the figures demonstrate neural network model’s high reliability as a model to predict whether or not a player will be a starter in NBA.


B. Model Comparison

Since both models have their target variable as a discrete categorical variable (the target variable in both models is whether or not the player’s number of games started is greater than 80% of their total games played), the only two model performance measurements that are valuable to compare are Accuracy (from the Confusion Matrix) and Area Under ROC Curve, as shown in Table I. Table I shows that the stepwise logistic regression model is the more reliable model compared with the neural network model due to its slightly higher accuracy and Area Under ROC Curve. However, the differences between the two models in terms of these two performance measurements are not too significant: the difference in the accuracy is only 0.4831% and the difference in the AUC value is only 0.00131. This means that both models are effective in predicting whether a player will be a NBA starter (both models have a high accuracy and AUC), but when we want to choose the optimal model, the stepwise logistic regression model is the optimal model to predict whether or not a player will be a NBA starter from the player’s statistics of points per game and total rebounds per game.


However, the neural network may achieve a higher accuracy and become more effective if it is trained longer since in this research there are only 10 hidden layers. By changing the function and increasing the number of hidden layers, the time to train may increase, but the results may also be more accurate.


6. Conclusion

In concluding this research paper I will provide the answer to the research question based on the results and discussion. According to the stepwise logistic regression model, the two most significant predictors of deciding whether or not a player will be a starter in NBA based on the data supplied by the official NBA website are points per game and total rebounds per game.


After analyzing the two models described above, the stepwise logistic regression model is the most effective model in predicting whether or not a player will be a NBA starter. However, the neural network model has similar accuracy and area under ROC plot so it could also be an effective model to use.


7.Limitations

A total of 11% of all players in NBA were omitted in this study due to missing data.


Another limitation is that this model is based on the current situation and trend of NBA. For example, the current NBA pays a lot of attention on the three-point shots. This model is assuming that the attention NBA players pay towards the three-point shots would stay constant in the future so that it could properly predict whether or not a player will be a starter. This would influence the prediction because if this is changed in the future, it might influence the points the players score, which influences the prediction since points per game is one of the inputs for the model.

8.Future Research

Future research could include adding more variables into the model. One of the exciting variables to add would be the salary because normally players with higher salary should be the better players compared with other players on the team. Another interesting approach could be to use an older data set because in the past player did not pay much attention to the three-point shots, which might influence the points scored by each player. It is interesting to see whether or not points per game would still be a significant predictor of whether a player will start in this case. Also, by combining the research focusing on the past statistics and this research, a new model could be proposed, which would be resistant to the change of attention to the three-point shots in the future.

9.References

  1. https://stats.nba.com.

  2. https://www.basketball-reference.com/.

  3. “National Basketball Association,” April 2020.

  4. Z. Cui, “What Are The Factors that Affect The All NBA Teams,” pp. 64–69, November. 2019.

  5. https://www.espn.com.

  6. J. Perricone, I. Shaw, and W. Swiechowicz, “Predicting Results for Professional Basketball Using NBA API Data,” Stanford University.










Comments


Top Stories

Join Our
​E-Mail List

Thanks for subscribing!

  • LinkedIn
  • Instagram
  • RedBook
  • Twitter

© 2035 by STEAM × Youth. 

bottom of page