Extracting data from Understat
Jason Zivkovic
2024-11-15
Source:vignettes/extract-understat-data.Rmd
extract-understat-data.Rmd
Overview
This package is designed to allow users to extract various world football results and player statistics from the following popular football (soccer) data sites:
Installation
As at 2024-06-29, we are no longer including instructions to install from CRAN. The version pushed to CRAN is very much out of date, and with very regular updates to this library, we advise installing from GitHub only.
You can install the released version of worldfootballR
from GitHub
with:
# install.packages("devtools")
devtools::install_github("JaseZiv/worldfootballR")
Usage
Package vignettes have been built to help you get started with the package.
- For functions to extract data from FBref, see here
- For functions to extract data from Transfermarkt, see here
- For functions to extract data for international matches from FBref, see here
- For functions to load pre-scraped data, see here
This vignette will cover the functions to extract data from understat.com
Understat Helper Functions
Team Names
To get a list of all available teams names team selected league, use
the understat_avalaible_teams()
function.
You can pass the results of the
understat_avalaible_teams()
function execution to the
understat_team_meta()
function.
team_names <- understat_team_meta(team_name = understat_avalaible_teams(league = 'EPL'))
Team URLs
To get a list of all season team URLs for selected teams, use the
understat_team_meta()
function (note, to get team names, it
might be advisable to look at Understat.com’s spelling of the team names
and pass that through to the function):
team_urls <- understat_team_meta(team_name = c("Liverpool", "Manchester City"))
League Season-Level Data
This section will cover the functions to aid in the extraction of season league statistics from Understat.
The following leagues are currently supported by Understat (these
values can be passed in to the league
arguments of most
understat_
functions):
- “EPL”
- “La liga”
- “Bundesliga”
- “Serie A”
- “Ligue 1”
- “RFPL”
Match Results
To be able to extract match results from Understat, which not only have results and expected goals, but they also provide a probability of a team winning.
To extract the data, use the
understat_league_match_results()
function:
# to get the EPL results:
epl_results <- understat_league_match_results(league = "EPL", season_start_year = 2020)
dplyr::glimpse(epl_results)
#> Rows: 380
#> Columns: 18
#> $ league <chr> "EPL", "EPL", "EPL", "EPL", "EPL", "EPL", "EPL", "EPL", …
#> $ season <chr> "2020/2021", "2020/2021", "2020/2021", "2020/2021", "202…
#> $ match_id <chr> "14086", "14087", "14090", "14091", "14092", "14093", "1…
#> $ isResult <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ home_id <chr> "228", "78", "87", "81", "76", "82", "238", "220", "72",…
#> $ home_team <chr> "Fulham", "Crystal Palace", "Liverpool", "West Ham", "We…
#> $ home_abbr <chr> "FLH", "CRY", "LIV", "WHU", "WBA", "TOT", "SHE", "BRI", …
#> $ away_id <chr> "83", "74", "245", "86", "75", "72", "229", "80", "76", …
#> $ away_team <chr> "Arsenal", "Southampton", "Leeds", "Newcastle United", "…
#> $ away_abbr <chr> "ARS", "SOU", "LED", "NEW", "LEI", "EVE", "WOL", "CHE", …
#> $ home_goals <dbl> 0, 1, 4, 0, 0, 0, 0, 1, 5, 4, 1, 2, 2, 0, 0, 4, 1, 1, 2,…
#> $ away_goals <dbl> 3, 0, 3, 2, 3, 1, 2, 3, 2, 3, 3, 1, 5, 3, 2, 2, 0, 3, 3,…
#> $ home_xG <dbl> 0.126327, 1.395690, 3.154120, 0.861445, 0.352997, 0.8229…
#> $ away_xG <dbl> 2.162870, 1.262670, 0.269813, 1.659110, 2.955810, 1.2679…
#> $ datetime <chr> "2020-09-12 11:30:00", "2020-09-12 14:00:00", "2020-09-1…
#> $ forecast_win <dbl> 0.0037, 0.3916, 0.9658, 0.1506, 0.0070, 0.2200, 0.1683, …
#> $ forecast_draw <dbl> 0.0476, 0.3022, 0.0296, 0.2480, 0.0358, 0.2977, 0.2906, …
#> $ forecast_loss <dbl> 0.9487, 0.3062, 0.0046, 0.6014, 0.9572, 0.4823, 0.5411, …
Season Shooting locations
To get shooting locations for a whole season in supported leagues,
use the understat_league_season_shots()
function:
ligue1_shot_location <- understat_league_season_shots(league = "Ligue 1", season_start_year = 2020)
Match-Level Data
The following sections outlines the functions available to extract data at the per-match level
Match Shooting Locations
To get shooting locations for an individual match, use the
understat_match_shots()
function:
wba_liv_shots <- understat_match_shots(match_url = "https://understat.com/match/14789")
dplyr::glimpse(wba_liv_shots)
#> Rows: 36
#> Columns: 20
#> $ id <chr> "422440", "422441", "422442", "422450", "422456", "422…
#> $ minute <dbl> 9, 11, 14, 35, 46, 47, 50, 61, 70, 77, 2, 3, 5, 23, 26…
#> $ result <chr> "MissedShots", "MissedShots", "Goal", "BlockedShot", "…
#> $ X <dbl> 0.869, 0.965, 0.881, 0.883, 0.957, 0.712, 0.767, 0.942…
#> $ Y <dbl> 0.441, 0.460, 0.356, 0.336, 0.590, 0.403, 0.590, 0.626…
#> $ xG <dbl> 0.0313527, 0.1447450, 0.2382660, 0.2825390, 0.0260821,…
#> $ player <chr> "Semi Ajayi", "Okay Yokuslu", "Hal Robson-Kanu", "Hal …
#> $ home_away <chr> "h", "h", "h", "h", "h", "h", "h", "h", "h", "h", "a",…
#> $ player_id <chr> "4490", "6932", "1738", "1738", "964", "7153", "7153",…
#> $ situation <chr> "SetPiece", "SetPiece", "OpenPlay", "OpenPlay", "FromC…
#> $ season <chr> "2020", "2020", "2020", "2020", "2020", "2020", "2020"…
#> $ shotType <chr> "Head", "Head", "LeftFoot", "LeftFoot", "Head", "LeftF…
#> $ match_id <chr> "14789", "14789", "14789", "14789", "14789", "14789", …
#> $ home_team <chr> "West Bromwich Albion", "West Bromwich Albion", "West …
#> $ away_team <chr> "Liverpool", "Liverpool", "Liverpool", "Liverpool", "L…
#> $ home_goals <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ away_goals <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
#> $ date <chr> "2021-05-16 15:30:00", "2021-05-16 15:30:00", "2021-05…
#> $ player_assisted <chr> "Matheus Pereira", "Darnell Furlong", "Matheus Pereira…
#> $ lastAction <chr> "Cross", "Chipped", "Pass", "HeadPass", "Aerial", "Sta…
Match Stats
To get the data from the stats table for an individual match, use the
understat_match_stats()
function:
wba_liv_stats <- understat_match_stats(match_url = "https://understat.com/match/14789")
dplyr::glimpse(wba_liv_stats)
#> Rows: 1
#> Columns: 20
#> $ match_id <int> 14789
#> $ home_team <chr> "West Bromwich Albion"
#> $ home_chances <dbl> 0.18
#> $ home_goals <int> 1
#> $ home_xG <dbl> 1.14
#> $ home_shots <int> 10
#> $ home_shot_on_target <int> 3
#> $ home_deep <int> 3
#> $ home_PPDA <dbl> 21.86
#> $ home_xPTS <dbl> 0.76
#> $ draw_chances <dbl> 0.22
#> $ away_team <chr> "Liverpool"
#> $ away_chances <dbl> 0.6
#> $ away_goals <int> 2
#> $ away_xG <dbl> 2.08
#> $ away_shots <int> 26
#> $ away_shot_on_target <int> 6
#> $ away_deep <int> 20
#> $ away_PPDA <dbl> 4.05
#> $ away_xPTS <dbl> 2.01
Match Players
To get the data for player in an individual match, use the
understat_match_players()
function:
wba_liv_players <- understat_match_players(match_url = "https://understat.com/match/14789")
dplyr::glimpse(wba_liv_players)
#> Rows: 27
#> Columns: 23
#> $ match_id <int> 14789, 14789, 14789, 14789, 14789, 14789, 14789, 14789, …
#> $ id <int> 471471, 471472, 471474, 471473, 471475, 471476, 471477, …
#> $ team_id <int> 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, 76, …
#> $ home_away <chr> "h", "h", "h", "h", "h", "h", "h", "h", "h", "h", "h", "…
#> $ player_id <int> 978, 4391, 964, 4490, 8905, 1737, 6932, 9040, 6651, 7153…
#> $ swap_id <int> 471471, 471472, 471474, 471473, 471475, 471476, 471477, …
#> $ player <chr> "Sam Johnstone", "Darnell Furlong", "Kyle Bartley", "Sem…
#> $ position <chr> "GK", "DR", "DC", "DC", "DL", "MR", "MC", "MC", "ML", "A…
#> $ positionOrder <int> 1, 2, 3, 3, 4, 8, 9, 9, 10, 12, 15, 17, 17, 17, 1, 2, 3,…
#> $ time_played <int> 90, 90, 90, 90, 90, 90, 80, 90, 78, 90, 87, 12, 10, 3, 9…
#> $ goals <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
#> $ own_goals <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ shots <int> 0, 1, 1, 2, 0, 0, 1, 0, 0, 2, 3, 0, 0, 0, 1, 4, 2, 1, 0,…
#> $ xG <dbl> 0.0000000, 0.0132741, 0.0260821, 0.0580258, 0.0000000, 0…
#> $ yellow_card <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ red_card <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
#> $ roster_in <int> 0, 0, 0, 0, 0, 0, 471484, 0, 471483, 0, 471482, 0, 0, 0,…
#> $ roster_out <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 471479, 471477, 471481,…
#> $ key_passes <int> 0, 2, 0, 1, 0, 0, 0, 1, 0, 4, 1, 0, 0, 0, 0, 5, 1, 0, 2,…
#> $ assists <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
#> $ xA <dbl> 0.0000000, 0.4272840, 0.0000000, 0.2957020, 0.0000000, 0…
#> $ xGChain <dbl> 0.2825390, 0.8165060, 0.0000000, 0.5339680, 0.2957020, 0…
#> $ xGBuildup <dbl> 0.2825390, 0.5339680, 0.0000000, 0.2382660, 0.2957020, 0…
Team Data
This section will cover off the functions to get team-level data from Transfermarkt.
Team Shooting Locations
To get all shots taken and conceded by a team during a season, use
the understat_team_season_shots()
function:
# for one team:
man_city_shots <- understat_team_season_shots(team_url = "https://understat.com/team/Manchester_City/2020")
dplyr::glimpse(man_city_shots)
#> Rows: 886
#> Columns: 20
#> $ id <chr> "378528", "378533", "378537", "378538", "378539", "378…
#> $ minute <dbl> 15, 40, 53, 55, 58, 59, 64, 73, 77, 86, 7, 10, 19, 29,…
#> $ result <chr> "BlockedShot", "MissedShots", "MissedShots", "BlockedS…
#> $ X <dbl> 0.789, 0.892, 0.860, 0.811, 0.822, 0.886, 0.869, 0.803…
#> $ Y <dbl> 0.564, 0.409, 0.501, 0.496, 0.398, 0.473, 0.259, 0.467…
#> $ xG <dbl> 0.03422860, 0.03680430, 0.10313500, 0.05339760, 0.0860…
#> $ player <chr> "Pedro Neto", "Raúl Jiménez", "Daniel Podence", "Rúben…
#> $ home_away <chr> "h", "h", "h", "h", "h", "h", "h", "h", "h", "h", "a",…
#> $ player_id <chr> "6382", "4105", "8291", "6853", "8291", "4105", "6853"…
#> $ situation <chr> "OpenPlay", "FromCorner", "OpenPlay", "OpenPlay", "Ope…
#> $ season <chr> "2020", "2020", "2020", "2020", "2020", "2020", "2020"…
#> $ shotType <chr> "LeftFoot", "Head", "LeftFoot", "LeftFoot", "RightFoot…
#> $ match_id <chr> "14105", "14105", "14105", "14105", "14105", "14105", …
#> $ home_team <chr> "Wolverhampton Wanderers", "Wolverhampton Wanderers", …
#> $ away_team <chr> "Manchester City", "Manchester City", "Manchester City…
#> $ home_goals <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ away_goals <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
#> $ date <chr> "2020-09-21 19:15:00", "2020-09-21 19:15:00", "2020-09…
#> $ player_assisted <chr> "Daniel Podence", "Adama Traoré", "Adama Traoré", "Ped…
#> $ lastAction <chr> "Pass", "Cross", "Pass", "Pass", "Chipped", "Cross", "…
Team Stat Breakdowns
To get a more granular breakdown of team shooting data for whole
seasons, the understat_team_stats_breakdown()
function can
be used. This functions returns a breakdown of team shooting data based
on the following groupings:
- Situation
- Formation
- Game state
- Timing
- Shot zones
- Attack speed
- Result
#----- Can get data for single teams at a time: -----#
team_breakdown <- understat_team_stats_breakdown(team_urls = "https://understat.com/team/Liverpool/2020")
dplyr::glimpse(team_breakdown)
#> Rows: 34
#> Columns: 11
#> $ team_name <chr> "Liverpool", "Liverpool", "Liverpool", "Liverpool", …
#> $ season_start_year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020…
#> $ stat_group_name <chr> "situation", "situation", "situation", "situation", …
#> $ stat_name <chr> "OpenPlay", "FromCorner", "SetPiece", "DirectFreekic…
#> $ shots <int> 466, 94, 23, 22, 6, 532, 33, 31, 13, 2, 302, 135, 10…
#> $ goals <int> 49, 11, 2, 0, 6, 59, 6, 1, 2, 0, 32, 15, 8, 11, 2, 6…
#> $ xG <dbl> 59.4529171, 9.1182853, 1.8527929, 1.3437825, 4.56701…
#> $ against.shots <int> 252, 40, 21, 12, 8, 296, 20, 9, 7, 1, 161, 80, 45, 3…
#> $ against.goals <int> 28, 6, 3, 1, 4, 38, 2, 1, 0, 1, 17, 12, 5, 3, 5, 8, …
#> $ against.xG <dbl> 33.1091621, 4.2281575, 3.9210222, 0.6303305, 6.08935…
#> $ time <int> NA, NA, NA, NA, NA, 3147, 216, 134, 81, 12, 1914, 73…
#----- Or for multiple teams: -----#
# team_urls <- c("https://understat.com/team/Liverpool/2020",
# "https://understat.com/team/Manchester_City/2020")
# team_breakdown <- understat_team_stats_breakdown(team_urls = team_urls)
Player Data
This section will cover the functions available to aid in the extraction of player data.
Player Shooting Locations
To get shooting locations for all games a player has participated in
(for as long as Understat has data for), use the
understat_player_shots()
function:
raheem_sterling_shots <- understat_player_shots(player_url = "https://understat.com/player/618")
dplyr::glimpse(raheem_sterling_shots)
#> Rows: 686
#> Columns: 20
#> $ id <chr> "14490", "14491", "14496", "14497", "14779", "15104", …
#> $ minute <dbl> 20, 22, 47, 53, 8, 7, 69, 74, 65, 81, 19, 25, 47, 50, …
#> $ result <chr> "SavedShot", "Goal", "SavedShot", "MissedShots", "Miss…
#> $ X <dbl> 0.853, 0.856, 0.816, 0.745, 0.857, 0.959, 0.940, 0.968…
#> $ Y <dbl> 0.695, 0.496, 0.377, 0.443, 0.470, 0.615, 0.524, 0.646…
#> $ xG <dbl> 0.0407033, 0.3114090, 0.0576012, 0.0254811, 0.0726696,…
#> $ player <chr> "Raheem Sterling", "Raheem Sterling", "Raheem Sterling…
#> $ home_away <chr> "h", "h", "h", "h", "a", "a", "a", "a", "h", "h", "a",…
#> $ player_id <chr> "618", "618", "618", "618", "618", "618", "618", "618"…
#> $ situation <chr> "OpenPlay", "OpenPlay", "OpenPlay", "OpenPlay", "OpenP…
#> $ season <chr> "2014", "2014", "2014", "2014", "2014", "2014", "2014"…
#> $ shotType <chr> "LeftFoot", "RightFoot", "RightFoot", "RightFoot", "Ri…
#> $ match_id <chr> "4756", "4756", "4756", "4756", "4768", "4777", "4777"…
#> $ home_team <chr> "Liverpool", "Liverpool", "Liverpool", "Liverpool", "M…
#> $ away_team <chr> "Southampton", "Southampton", "Southampton", "Southamp…
#> $ home_goals <dbl> 2, 2, 2, 2, 3, 0, 0, 0, 0, 0, 3, 3, 3, 3, 1, 1, 1, 1, …
#> $ away_goals <dbl> 1, 1, 1, 1, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ date <chr> "2014-08-17 13:30:00", "2014-08-17 13:30:00", "2014-08…
#> $ player_assisted <chr> "Philippe Coutinho", "Jordan Henderson", "Jordan Hende…
#> $ lastAction <chr> "Pass", "Throughball", "Pass", "Pass", "Chipped", "Pas…
Team Player Season Stats
To get stats for all players of selected teams, run the
understat_team_players_stats()
function.
Note: Team URLs cal be extracted using
understat_team_meta()
.
team_players <- understat_team_players_stats(team_url = c("https://understat.com/team/Liverpool/2020", "https://understat.com/team/Manchester_City/2020"))
dplyr::glimpse(team_players)
#> Rows: 52
#> Columns: 19
#> $ season <chr> "2020/2021", "2020/2021", "2020/2021", "2020/2021", "2020…
#> $ player_id <dbl> 1250, 838, 482, 6854, 771, 1791, 229, 332, 605, 833, 966,…
#> $ player_name <chr> "Mohamed Salah", "Sadio Mané", "Roberto Firmino", "Diogo …
#> $ games <dbl> 37, 35, 36, 19, 38, 36, 24, 10, 21, 5, 13, 33, 38, 24, 17…
#> $ time <dbl> 3085, 2805, 2882, 1114, 2961, 3040, 1865, 701, 1710, 370,…
#> $ goals <dbl> 22, 11, 9, 9, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0…
#> $ xG <dbl> 20.2508505, 14.8285516, 12.8602165, 7.0577230, 2.8174270,…
#> $ assists <dbl> 5, 7, 7, 0, 0, 7, 0, 2, 1, 0, 1, 0, 7, 2, 1, 0, 0, 1, 0, …
#> $ xA <dbl> 6.5285276, 7.7877541, 6.1168645, 1.7625196, 1.6629221, 8.…
#> $ shots <dbl> 126, 94, 83, 46, 31, 55, 22, 5, 14, 4, 8, 1, 19, 19, 15, …
#> $ key_passes <dbl> 55, 61, 44, 12, 21, 77, 30, 3, 14, 0, 2, 0, 65, 12, 7, 0,…
#> $ yellow_cards <dbl> 0, 3, 2, 2, 1, 2, 4, 2, 0, 1, 0, 1, 2, 2, 2, 0, 0, 3, 0, …
#> $ red_cards <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ position <chr> "F M S", "F M S", "F M S", "F M S", "M S", "D S", "M S", …
#> $ team_name <chr> "Liverpool", "Liverpool", "Liverpool", "Liverpool", "Live…
#> $ npg <dbl> 16, 11, 9, 9, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0…
#> $ npxG <dbl> 15.6838341, 14.8285516, 12.8602165, 7.0577230, 2.8174270,…
#> $ xGChain <dbl> 28.9682294, 24.9989162, 25.2714681, 10.9729662, 13.922178…
#> $ xGBuildup <dbl> 9.8002365, 6.0576597, 10.1985496, 4.0760983, 10.4762759, …