Introduction

American football is a sport that has been played in the United Stated since 1869. The sports has been intertwined in the culture of the country for over a century. I will refer to American football as football for the purpose of this project. The sport is celebrated and enjoyed every year at all levels. From middle school to professional leagues, football is the most popular sports in the United States. The National Football League (NFL) the most popular American professional league has the highest average attendance of any professional sports league in the world. Its championship game the Super Bowl, is one of the most watched club sporting events in the world.

There is a total 32 teams that play in the NFL. The teams are divided into two conferences, American Football Conference (AFC), and the National Football Conference (NFC). Each of the conferences has 16 teams each, and the teams that fall in each conference are fixed each year. There is a 3-week pre-season, an 18-week long regular season, followed by playoffs and finally the championship game, which is the Super Bowl. The teams

In this project I will predict which team is most likely to win the NFL Super Bowl 2023 using machine learning techniques.

Process Breakdown

There are several steps that need to be followed and determined to be able to reach a point where a statistical model and machine learning model can be built. The initial step would be to determine what variables/features should be used, then I will collect the data containing those variables. The data then needs to be prepared in an optimal format to make it usable. Further I will perform Exploratory Data Analysis to gain insights from the data. After I have the insights. I’ll have the appropriate amount of knowledge to apply a Machine Learning model. I will split the data into a training and testing set to be able to test the model developed. Later I will conclude the findings of the model and discuss future improvements.

Selecting appropriate features

A team in football has various sub-units that work together to make the team win. The team consists of an offensive team, defensive team, and a special team. Hence, it was crucial to consider statistics from all the aspects of a team. From my research, I considered some statistics to be important to judge a team’s performance in the season. The statistics required for this project are:

Team Statistics:

1. Point Differential – This number signifies the sum of the difference in points scored in all games for that team. For eg: If a team wins game 1 with a score of 21-17, the PD is 4. The same team loses their game 2 with a score of 20-21, their new D would be 4 + (-1) = 3.Winning percentage – This is a percentage metric which gives an idea about how often the team wins. If the winning % of a team is 50%, the team wins 50% of the games it plays.

2. Average ball possession/Game – This is a percentage metric of the average ball possession of a team throughout the season. For eg: If a team has a possession percentage/game of 60%, the team on an average throughout the season possessed the ball 60% of the total game duration.

Offensive Statistics:

3. Percentage of offensive drives ending in a score – This is a percentage metric providing information of the scoring success of the offensive team.

4. Percentage of offensive drives ending in the Red Zone – This is a percentage metric that gives insight on the performance of the offensive team. It measures the ratio of the offensive drives that end in the Red Zone compared to all the offensive drives the team has made in the season. The Red Zone is the 20 yard area adjacent to the End Zone, which is where the offensive team is trying to get the ball in.

5. Offensive turnover percentage – This is a percentage metric that measures the ratio of the offensive drives that end up as a turnover via any method and the total offensive drives made by the team per season. The lower this statistic, the better it is for the team’s chances of winning.

6. Average Yards/Game – This is a numeric metric which measures the average yards the offensive team covers trying to move towards the endzone per game.

7. Total Points scored/Game –  This is a numeric metric giving insight on the average of the points the team scores per game throughout the season.

Defensive Statistics:

8. Average Points allowed by the defense/Game – This is a numeric metric of the points scored by the opposing team while the defense was trying to defend the End Zone.

9. Average Yards allowed by defense/Game – This is a numeric metric that gives insight about how many yards the opposition offensive team was able to gain while the defense was trying to defend the End Zone.

Special Teams Statistics:

10. Field Goals made per season – This is a numeric metric of the Field goals made by a team in the whole season.

11. Field Goal success percentage – This is a percentage metric which measures the ratio of all the field goals made over all the field goals attempted

12. Percentage of punts that are within 20 yards –  This is another percentage metric signifying the ratio of the punts that make it within 20 yards of the endzone compared to the total number of punts made by the punter.

Playoff Statistics:

13. Playoff win percentage – Since not all the teams make it in the playoffs, this metric is quite important. Only seven teams from each conference qualifies for the playoffs. This percentage is a ratio of the games a team wins in the playoff over the total number of games played by the team in the playoffs.

Labels to predict

The purpose of the project is to predict if the team will win the NFL Super Bowl or not. Hence the labels will be 1 and 0, for winning and losing the Super Bowl respectively.

Data Collection

Now that we have the features and labels deemed relevant for this project, I decided to hunt the web to find this data. My goal is to find all the features mention in section 4, for all 32 teams from year 2010-present. I found the data that I needed on three websites; www.pro-football-reference.com, covers.com and teamrankings.com. Fortunately covers.com and teamrankings.com allowed me to scrape the data off of their website using a URL. On the contrary, extracting data from www.pro-football-reference.com was extremely tedious. The data from www.pro-football-reference.com was downloaded as Excel files and then converted to CSV. For each year I downloaded 5 tables from www.pro-football-reference.com, 2 from covers.com and 1 from teamrankings.com. I collected the data for years 2010-2021. The data for 2022 was collected after completion of week 13 of the regular season. I will continue collecting 2022 data till the last week of January to apply the model on the latest data at any time.