Where is the future for NBA analytics going? The buzz words you will hear more and more over the next few years will be “artificial intelligence”, “machine learning” and “deep learning”.
What the hell does that mean?
Well in short it’s letting computers learn and think for themselves, coming up with predictions without having a human giving it instructions.
Remember the Terminator movies? That crazy computer program called "Skynet" was thinking for itself and going to blow up the world………..that’s kind of what I’m talking about here. Maybe a better example are the self-driving cars that we are going to see more of on our roads in the next few years, all utilizing these buzzwords.
What I have put together below is an example of using machine learning in the NBA. I have created a model that thinks for itself and puts together a high level team scout for each NBA team automatically.
There are so many stats out there right now to evaluate teams, where do you even start? And how do you know if you are looking at something that is meaningful and matters? Well the intention of this model is to give coaches and scouts that starting point. Pointing them in the right direction to come up with a game plan and to also self-evaluate what is working against their own team.
Teams can obviously also use this automatic scout to help confirm what their own scouts and coaches are thinking.
The model doesn’t have any bias like a human, it doesn’t like certain teams or players, and it doesn’t have any preconceived ideas on what it thinks a team does well.
What this model doesn’t do, is go into nitty-gritty detail like how to defend a pick and roll or what offensive sets a team runs, which you would see in an in-depth team scouting report. That is what coaches or scouts would potentially do after having the auto scouting model point them in the right direction.
What we end up with is a high level scouting report that ranks the top 5 things a team does well when they win and the top 5 ways teams beat them, all based on over 50 different traditional and advanced statistics.
One trick with machine learning models is that it needs to learn. It needs as much data as possible to get an understanding of what is important and to then be able to make accurate predictions.
My model has taken all of last season’s data to learn how teams win and lose games. If you are starting off a new season, the results you get from the model obviously are not going to be as accurate until the model gets enough games under its belt to learn from. It’s not to say you won’t get predictions, but they won’t be as accurate as when the season progresses.
As machine learning is a very technical area, I will go into more detail at the end of this post for any nerds that are interested in how I put this all together.
For everyone else that just wants to see the results, you can have a play with the interactive table below.
From looking at these results you can see straight away that the auto scout is confirming some common perceptions about teams:
What could a team then do with these results? Well let’s look at Dallas for example, if they were self-evaluating themselves and looking at how teams beat them.
We can see the number one way teams beat Dallas is through made 3 pointers. This tells me Dallas doesn’t defend the 3 point shot well, I am assuming teams didn’t just get lucky all season and made a lot of 3’s against them.
Coaches could then dive into this deeper themselves and pin point the reasons why, are they being exposed by drive and kick? Ball-movement? Do they have defensive principles that are impacting help and recover on defense? Are they over helping?
The auto scout isn’t sophisticated enough yet to pin point the exact reason for all the made 3’s, but what it does do is point coaches and scouts in the right direction. Therefore concentrating their efforts on things that matter and not getting bamboozled by all the stats and analysis that is out there.
The next question you might have is, yeah but how can I trust this thing? Is it spitting out rubbish data or is it accurate? Well from the results that I am getting, every team is slightly different in terms of the models accuracy but what I have found is that the model rarely get the outcome of a game wrong and very rarely does it predict a game worse than 5 points.
The graph below shows the predictions the model has made for a number of OKC Thunder games last season. You can see most predictions are within 5 points of the actual result.
The model is taking about 50 different statistics that happened during a game and it then makes a prediction on what the outcome of the game was.
Now that we know the model is accurate at predicting the outcome of the game, we can therefore trust that is knows what the most important statistics are in determining the outcome of each game, as it’s making its predictions based off of that data.
OK, so now to the super nerdy stuff. Look away now if you are not interested, but I’ll try and keep it as simple as possible and explain how this all got put together. You would want to have some knowledge of python & machine learning to have any hope of following this.
Here is a high level process flow and you can see that the model was developed predominantly using the programming language python.
I’ll now go through the steps in a bit more detail, with every step having a nerdy explanation and then what it means in plain English.
1. To gather the data, I used my own modified version of the freely available nba_py API to scrape the data from the NBA website. It just needed a little clean up with a few endpoints as the NBA.com stats website changes pretty frequently.
In simple terms, I grabbed all meaningful stats from the NBA website.
2. The data comes into python as a pandas data frame where I then exported it to a CSV file and played around with the file in Excel and SQL server. Picking and choosing a set of data from traditional and advanced box score stats that I would use as a starting point.
In simple terms, I got all of the data and stored in into a Microsoft Excel like file that I could play around with.
3. With the data in a reasonable format, it’s time to begin the machine learning process in python which has a number of structured steps. You need to follow these steps before you have a robust model that starts spitting out useful predictions.
In simple terms, there are a few steps you need to work your way through when trying to apply machine learning.
4. I first load in all the statistics and start evaluating the data using data visualizations such as plots with Matplotlib, histograms and a correlation matrix. These tools help you understand the data a little more, what’s important, what’s not, which algorithm might best suit. This step really raises more questions than answers but helps you later on in applying the right algorithm.
In simple terms, it’s good to understand the data you are looking at a bit further. Some pretty pictures can help you identify if there is rubbish data for example that you can delete to simplify things.
The below is an example of a correlation matrix that helps to determine how well variables relate to each other. This is useful in telling us if 2 variables are far too co-correlated, this can have adverse effects on regression algorithms. The bright yellow and almost black boxes below show a high co-correlation and are ones where I would then remove statistics and therefore simplify the data set.
In simple terms, if you see any really light or really dark boxes then they are good candidates for data that can be deleted as you don’t need them
5. Now that I have a data set that doesn’t have so many statistics, I run it through a number of different algorithms to test the performance and accuracy of predictions. You never quite know which algorithm will work best, so it’s really a scatter gun approach to begin with.
What I do know is that my data has the look and feel of a regression model as I ultimately want to predict the outcome of a game in terms of the winning or losing margin.
I therefore ran the data-set through the following algorithms that are both linear and nonlinear, using a technique to break up the data between a training set and test set called K-folds. What this does is split the data set into a set that an algorithm can train itself on and then a test set that it can actually try and make predictions on.
Next, I ran the data set through algorithms that are called ensembles and in most cases can give a better result. These were:
I also applied what are called data pipelines which basically stops data form the training set from leaking into the test set. This prevents the model from have an early peak at unseen data and getting to know it.
In simple terms, I worked out which model would make the most accurate predictions.
6. The final model I came up with used the Lasso Regression algorithm, I won’t dive into how this model works and it will confuse us all. Google it, if you are interested.
As I mentioned earlier, the model is making pretty good predictions on the outcome of games and I now want to know which statistics have the biggest impact on the outcome of a game. What the model brings back here would form the auto scout.
I used a machine learning technique called Recursive feature elimination, which basically ranks which statistics the model found to be the biggest factors in teams winning or losing.
And that's about it!
If you are still with me at this point, well done! All I will say from here is, machine learning will be a powerful tool moving forward for NBA teams and I hope this post gave you some insight on its value and what is involved to make it happen.