Reading Multiple CSVs into Merged Python3 DataFrame

The purpose of this script is to load and clean up all the various .csv files containing trading data into Python3.

##Import libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

#Read the train data
Train_data=pd.read_csv('../input/jane-street-market-prediction/train.csv')

#Read the test data
Test_data=pd.read_csv('../input/jane-street-market-prediction/example_test.csv')

#Read the features
features=pd.read_csv('../input/jane-street-market-prediction/features.csv')

#Exploring the Data
#Return the first five rows of the training dataset
Train_data.head()

#First I get a quick glimpse of the weight column and its distribution
Train_data['weight'].describe()

#Here I am creating a new column called 'return' where return is the weight * resp
Train_data['return']=Train_data['weight']*Train_data['resp']

Train_data['return'].describe()

Train_data['return'].head()

import matplotlib.pyplot as plt
plt.scatter(x='date', y='return', data=Train_data)
plt.show()

NFL Big Data Bowl

In this project, we analyzed the blitz plays in the NFL in the 2018 season. I collaborated with Sabri Rafi and Liyi Liang.

Introduction

We are recent college graduates with a passion for football. Welcome to our NFL Big Data Bowl 2021 presentation.

A blitz is defined as when four or more passrushers bring pressure to sack the quarterback or to disrupt the play.
The blitz is oftentimes used in high pressure situations to give the quarterback less time to make a decision to either pass or run the ball.

Some plays in football are more critical in determining the outcome of the game than others. Situations where
a blitz occurs is considered to be one of those moments. Due to the intensive nature of a blitz, NFL teams may find
useful insights when analyzing the specific instances where blitzes occur and seeing what variables that contribute to the overall defense success.

In this project, we analyze interesting relationships between the blitz and other features to figure out the optimum instances of when blitzes are most successful. This information will be highly valuable to NFL teams as they will be able to use this in live game situations based on the features we have identified.

#Import pandas and numpy for data analysis
import pandas as pd
import numpy as np

#Import Seaborn and Matplotlib for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

#Import Sklearn Preprocessing Package
import sklearn.preprocessing as preprocessing

##Import Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

##Import Linear Regression
from sklearn.linear_model import LinearRegression

##Import train_test_split
from sklearn.model_selection  import train_test_split

##Import Accuracy Score
from sklearn.metrics import accuracy_score

#Regular Expressions
import re

#Warning Messages Control
import warnings
warnings.filterwarnings('ignore')

##Handle Date Time Convertions between pandas and matplotlib
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

# Use white grid plot background from seaborn
sns.set(font_scale=1.5, style="whitegrid")

#We are using the Bayesian Method 
plt.style.use('bmh')

#We automatically generate the EDA using the Pandas_profiling package
import pandas_profiling

##This gives the maximum columns in a pandas dataFrame. Sometimes pandas will hide columns. 
pd.set_option('max_columns', 100)

#Import plays and game data
plays = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2021/plays.csv')
games = pd.read_csv('/kaggle/input/nfl-big-data-bowl-2021/games.csv')
players=pd.read_csv('../input/nfl-big-data-bowl-2021/players.csv')

#Return  the mean number of defenders in the box given that the pass result is a 'Sack'  
plays.groupby(plays['passResult']=='S')['defendersInTheBox'].mean()

#Return the median number of defenders in the box if the pass result is a sack. 
plays.groupby(plays['passResult']=='S')['defendersInTheBox'].median()

#Return the number of Pass Rushers by the pass result
plays.groupby(by=plays['passResult'] == 'S')['numberOfPassRushers'].value_counts()

#Return the mean number of Pass Rushers grouped by Pass Result
plays.groupby('passResult', as_index=False)['numberOfPassRushers'].mean()

#Return the mean number of Pass Rushers grouped by Pass Result
plays.groupby('passResult', as_index=False)['numberOfPassRushers'].mean()

#Create a list of offensive, defensive, and special team roles
cat_item = {'Offense': ['QB', 'RB', 'FB', 'WR', 'TE', 'HB'], 
            'Defense': ['OLB', 'MLB', 'LB', 'ILB', 'CB', 'DE', 'DT', 'NT', 'DB', 'S', 'SS', 'FS'], 
            'Special': ['K', 'P', 'LS']}
item_cat = {w: k for k, v in cat_item.items() for w in v}

#We are mapping the positions that we have defined to the team_role column. 
players['team_role'] = players['position'].map(item_cat)

#Returns the last five rows of the players dataset
players.tail(5)

#Returns the defensive positions
cat_item['Defense']

Data Cleaning

#Section 1 - #Correcting, Completing, Creating, and Converting

def ProcessPlays(df):
    
    #Return df if the function has been executed
    if 'Blitz'in df:
        return df 
    
    ##############################################################
    #Blitz
    #Categorize if it is a blitz
    df['Blitz'] = np.where(df['numberOfPassRushers']>4, 1,0)
    
    ##############################################################
    #Personnel
    df.dropna(subset=['personnelO','personnelD'])  #lot of missing records in the end
    
    #Offense Team Formation using 'personnelO' column 
    temp = df['personnelO'].str.split(',',n = 2, expand = True)
    df['Off_RB'] = temp[0].str.split(' ',n = 1, expand = True)[0]
    df['Off_TE'] = temp[1].str.split(' ',n = 2, expand = True)[1]
    df['Off_WR'] = temp[2].str.split(' ',n = 2, expand = True)[1]

    #Defense Team Formation using 'personnelD' column
    temp2 = df['personnelD'].str.split(',',n = 2, expand = True)
    df['Def_DL'] = temp2[0].str.split(' ',n = 1, expand = True)[0]
    df['Def_LB'] = temp2[1].str.split(' ',n = 2, expand = True)[1]
    df['Def_DB'] = temp2[2].str.split(' ',n = 2, expand = True)[1]
    
    return df

plays = ProcessPlays(plays)

Defense Success

#Defense Success

def DefenseSuccess(df):
    #Return df if the function has been executed
    if 'Defense_Success'in df:
        return df 
    
    #use 'passResult' 
    df['Defense_Success'] = np.where(df['passResult'] != 'C',1,0 )

    #fourth down - playResult < yardsToGo
    df['Defense_Success'].loc[(df['yardsToGo'] > df['playResult']) &  (df['down']==4 )] = 1
    
    #At least 3.5 yards per play
    df['Defense_Success'].loc[(df['playResult']<3.5) &  (df['down']!=4 )] = 1

    return df

plays = DefenseSuccess(plays)

Additional Cleaning

def AdditionalCleaning(df):
    
    #Return df if the function has been executed
    if not 'playDescription'in df:
        return df 
    
    #Section 3 - Additional Cleaning 
    
    #gameClock - Fill in Missing Values
    
    na_clock = df[df.gameClock.isnull()]
    na_clock['TimeFromDesc'] = na_clock['playDescription'].str.split(' ',n = 1, expand = True)[0]
    na_clock['TimeFromDesc'] =  na_clock['TimeFromDesc'].str[1::].str[:-1]
    na_clock['TimeFromDesc'] = na_clock['TimeFromDesc'].str.split(':')
    na_clock.loc[:,'mm'] = na_clock['TimeFromDesc'].map(lambda x:x[0]).str.zfill(2)
    na_clock.loc[:,'ss'] = na_clock['TimeFromDesc'].map(lambda x:x[1]).str.zfill(2)
    na_clock['gameClock'] = na_clock['mm'].astype(str) +':' + na_clock['ss'].astype(str) + ':00'
    df['gameClock'].loc[df['gameClock'].isnull()] = na_clock['gameClock']

    #change gameClock to 'SecondsToEndQuar' (How many seconds left till the end of the current quarter)
    
    df['mm'] = df.gameClock.str[:2]
    df['ss'] = df.gameClock.str[3:5]
    df['SecondsToEndofQuar'] = 15*60 - (df['mm'].astype(int)*60 +df['ss'].astype(int))
    
    #drop the unnecesaary cols
    df.drop(['mm','ss','gameClock'],axis=1, inplace=True)
    
    ##############################################################
    #possessionTeam and yardlineSide --- too much categories.
    #add a column to define if possessionTeam == yardlineSide
    df['PossYardlineSameSide'] = 1
    df['PossYardlineSameSide'].loc[df['possessionTeam']!=df['yardlineSide']] = 0
    
    #df.drop(['possessionTeam','yardlineSide'],axis=1, inplace=True)
    
    ##############################################################
    #offenseFormation fillna (141 records) based on distribution. 
    OffFor_Dist = df.offenseFormation.value_counts(normalize = True)
    OffFor_na = df['offenseFormation'].isnull()
    df.loc[OffFor_na,'offenseFormation'] = np.random.choice(OffFor_Dist.index,size = len(df[OffFor_na]),p=OffFor_Dist.values)    
    
    ##############################################################
    #typeDropback - fill in missing values
    # array(['TRADITIONAL', 'SCRAMBLE_ROLLOUT_LEFT', 'DESIGNED_ROLLOUT_LEFT', 'SCRAMBLE_ROLLOUT_RIGHT', 'DESIGNED_ROLLOUT_RIGHT', 'SCRAMBLE','UNKNOWN', nan], dtype=object)
    df['typeDropback'].loc[df.typeDropback == 'UNKNOWN'] = np.nan
    DropbackDist = df.typeDropback.value_counts(normalize = True)
    Dropback_na = df['typeDropback'].isnull()
    df.loc[Dropback_na,'typeDropback'] = np.random.choice(DropbackDist.index,size = len(df[Dropback_na]),p=DropbackDist.values)    
    
    ##############################################################
    #drop na
    df.dropna(how = "any", inplace = True)
    ##############################################################
    #Change from object to int
    df['Off_RB'] = df['Off_RB'].astype(int)
    df['Off_TE'] = df['Off_TE'].astype(int)
    df['Off_WR'] = df['Off_WR'].astype(int)
    df['Def_DL'] = df['Def_DL'].astype(int)
    df['Def_LB'] = df['Def_LB'].astype(int)
    df['Def_DB'] = df['Def_DB'].astype(int)
    
    
    ##############################################################
    #drop some of the columns; 
    df.drop(['playDescription',
             'playType',
             #'typeDropback',
             #'preSnapVisitorScore',
             #'preSnapHomeScore',
             #'numberOfPassRushers',  
             'penaltyCodes',         #-Remove -- "result"
             'penaltyJerseyNumbers', #-player
             'epa',
             #,'offenseFormation',
             'personnelO',
             'personnelD',
             'passResult',           #-Remove -- "result" 
             #'offensePlayResult',    #-Remove -- "result" 
             'playResult',           #-Remove -- "result" 
             'isDefensivePI',        #-Remove -- "result" 
             #'Blitz',               
             #'gameId',             
             'playId'                #-Remove -- id column
            ],axis=1, inplace=True)
    return df

plays = AdditionalCleaning(plays)

EDA with Pandas Profiling

#Exploratory Data Analysis with Pandas Profiling
report_plays = pandas_profiling.ProfileReport(plays)

display(report_plays)

Graphs

# Graph 1 - Which Down Is The Most Optimal Down To Blitz In?

Based on our results, it is evident that 3rd down is by far the best down to blitz as the defense success rate was close to 60%. This is interesting as the other other downs were nowhere close to being as successful. Blitzes failed a majority of the time on 1st and 2nd down, and were about 50-50 on 4th down. One of the reasons is due to 3rd down being a high pressure moment and the additional pressure of the blitz may result in more opportunities for defensive success. One may assume or expect that trend to continue on 4th down, however generally if an offensive team decides to go for it 4th down, they only have a few yards to go before converting. This explains why the defensive success rate goes down here.

#down
df1 = pd.DataFrame(plays.groupby(['down'])['Defense_Success'].mean())
df1['Defense_Fail'] = 1-df1['Defense_Success']
df1 = df1.reset_index()

#Plot
df1.plot(x="down", y=["Defense_Fail", "Defense_Success"], kind="bar")
plt.legend(loc='upper left', bbox_to_anchor=(1, 0.5))

Graph 2 – Which Offensive Formation Was The Blitz Most Successful Against?

#offenseFormation
df2 = pd.DataFrame(plays.groupby(['offenseFormation'])['Defense_Success'].mean())
df2['Defense_Fail'] = 1-df2['Defense_Success']
df2 = df2.reset_index()

#Plot
df2.plot(x="offenseFormation", y=["Defense_Fail", "Defense_Success"], kind="bar")
plt.legend(loc='upper left', bbox_to_anchor=(1, 0.5))

If defenses know which offensive formations are more susceptible to the blitz, coordinators and coaches would have an easier time determining if one needs to be dialed up. Our data shows that pistol formation had the best defensive success rate at about 80%. What is interesting to note here is that only I formation, pistol, and shotgun formation had defensive success rates over 50%. After that being said, pistol formation overwhelmingly had the highest defense success rate which indicates that blitzes should be dialed up when the offensive is in that formation. Wildcat and jumbo formation were about 50-50 in terms of defensive success, and singleback and empty formation were the least effective with a defensive success rate between 30-45%.

Graph 3 – How Many PassRushers Should Teams Send?

#numberOfPassRushers
df3 = pd.DataFrame(plays.groupby(['numberOfPassRushers'])['Defense_Success'].mean())
df3['Defense_Fail'] = 1-df3['Defense_Success']
df3 = df3.reset_index()

#Plot
df3.plot(x="numberOfPassRushers", y=["Defense_Fail", "Defense_Success"], kind="bar")
plt.legend(loc='upper left', bbox_to_anchor=(1, 0.5))

As defined earlier a blitz is when more than 4 passrushers bring pressure to sack the quarterback or disrupt the play. The key number that we are analyzing here is how many passrushers should the defense send to create the most amount of defensive success. When teams sent 4 passrushers or less(not a blitz), there was less defensive success as opposed to when blizes occurred with 5 or more passrushers. One critical point to mention however, is that when 7 passrushers were sent into blitz, it had the worst defensive success percentage at around only 20%. When only 6 passrushers were sent into blitz, interestingly enough, it had the highest defensive success percentage at around 60%. This indicates that a team can send too many passrushers which resulted in a bad defensive success percentage, and based on our data, that number is 7. This is because when teams send passrushers to blitz, it leaves wide receivers, tight ends, and running backs more wide open because there are less defensive players covering them, as they are trying to blitz to the play. While the quarterback may have less time to throw, they may have an easier time making a play to one of their other offensive teammates, which is what the data shows.

Graph 4 – Each Team’s Blitz Percentage

gameTeam = games[['gameId','homeTeamAbbr','visitorTeamAbbr','week']]

#Merge to get the Defense Team
game_play = plays.merge(gameTeam,on='gameId')
game_play = game_play[['possessionTeam','homeTeamAbbr','visitorTeamAbbr','week','Blitz']]
game_play['DefenseTeam'] = 'temp'
game_play['DefenseTeam'].loc[game_play.possessionTeam == game_play.homeTeamAbbr] = game_play.visitorTeamAbbr
game_play['DefenseTeam'].loc[game_play.possessionTeam != game_play.homeTeamAbbr] = game_play.possessionTeam

#Plot the percentage of Blitz for each team
fig, ax = plt.subplots(figsize = (12,3))
game_play.groupby(['DefenseTeam'])['Blitz'].mean().plot(kind = 'bar')
ax.set_title('% of Blitz for Each Team')

NFL teams and coaches have philosophies and formations that are a part of their defensive identity. Specifically some teams have the tendency to blitz more than other teams. When analyzing a team’s blitz percentages, in relation to other teams, this gives those teams more insight into their own tendencies and allows them to readjust accordingly based on their own metrics for success. Based on the graph below it appears that the Houston Texas and Miami Dolphins have the highest blitz percentage at around 50% and the Tennessee Titans and Buffalo Bills had the lowest blitz percentage at about 10%. Additionally, this information is very valuable, not just to defensive teams, but offensive teams as well. If an offensive knows the defensive team’s tendencies, they make adjustments in their formations to counteract them.

Graph 5 – How Many Defensive Backs Should Be On The Field In A Blitz?

#Def_DB
df5 = pd.DataFrame(plays.groupby(['Def_DB'])['Defense_Success'].mean())
df5['Defense_Fail'] = 1-df5['Defense_Success']
df5 = df5.reset_index()

#Remove DB = 2 to avoid confusion
df5 = df5[1:]

#Plot
df5.plot(x="Def_DB", y=["Defense_Fail", "Defense_Success"], kind="bar")
plt.legend(loc='upper left', bbox_to_anchor=(1, 0.5))
plt.title("Number of Defensive Backs v.s. Defense Success Rate")
plt.xlabel("Number of Defensive Backs")

Next we analyze different positions to see what types of personal grouping are most successful when conducting a blitz. By doing so, NFL teams can figure out the correct packages to send in on a blitz. If an NFL team decides to blitz, there should be exactly 3, 6, or 7 defensive backs present as those are the numbers with the highest defensive success rate as both were greater than 50%. While this may be counterintuitive, it actually makes a lot of sense. It is standard that there are at least 4 defensive backs, consisting of two cornerbacks and two safeties. If there are less than 4 defensive backs, it may indicate there is some sort of goal line formation where blitzes are more typical and there may be substituted linebackers on the passrush instead. Additionally, with a greater number of defensive backs at 6 or 7, not all of them may be blitzing, but typically a few of them might be which is a considerable amount of pressure to put on the quarterback. This is subjective to the gameplay, design, and the context of the play so this will vary, but these are extremely important insights to dive into further.

Graph 6 – When Is The Best Time To Blitz?

SecondsToEndofQuar

sns.distplot(plays[plays.Defense_Success == 0].SecondsToEndofQuar)
sns.distplot(plays[plays.Defense_Success == 1].SecondsToEndofQuar)
plt.title("Down v.s. Defense Success Rate")

Conventional wisdom usually indicates that it is better to blitz in high pressure situations. Based on the data, however, it appears that blitzes are actually more unsuccessful when there are 200 seconds or less in a quarter. Between 200 and 600 seconds in a quarter, it appears that they are more successful. The success level appears to vary between 600 to 900 seconds left in a game. This is an extremely interesting find because it seems that more highly pressured 3rd downs are more effective for blitzing as opposed to late game timing. Coaches may be saving their most successful plays during this time, different successful formations may be used the offensive side, and typically scores do tend to tighten up later in the game when there is more of a sense of urgency. As the game is generally on the line during later in the game, it appears that offenses do better against the blitz and it should be advised against.

Conclusion –

In conclusion, we found several different insights into the most optimum situations of when to blitz. These insights when combined with the NFL team’s specific packages and formations can provide for a strong game plan that takes advantage of the analytical data provided. This can provide teams with a massive edge come game day and can be the difference between winning and losing a game. By using this information to their advantage, it will change how teams conduct their defenses and revolutionize how the game is played.

We have additional insights that we would love to present to you and are looking forward to the opportunity to do so.