10

希望提高我的数据科学技能。我正在练习从体育网站提取 url 数据,并且 json 文件有多个嵌套字典。我希望能够提取这些数据以在 matplotlib 等中映射我自己的自定义排行榜形式,但是很难将 json 转换为可行的 df。

主网站是:https ://www.usopen.com/scoring.html

看看背景,我相信实时信息是从下面的短代码中列出的链接中提取的。我正在使用 Jupyter 笔记本。我可以成功拉取数据。

但正如您所看到的,它正在拉取多个嵌套字典,这使得拉取简单的数据框变得非常困难。

只是在寻找球员,得分达到标准杆,总杆数和拉杆数。任何帮助将不胜感激,谢谢!

import pandas as pd
import urllib as ul
import json
url = "https://gripapi-static-pd.usopen.com/gripapi/leaderboard.json"
response = ul.request.urlopen(url)
data = json.loads(response.read())
print(data)
4

4 回答 4

11
  • 用于requests.get(url).json()获取数据
  • 用于pandas.json_normalizestandings密钥解包到数据框中
  • roundScores是一个字典列表
    • 该列表必须扩展为.explode
    • dicts 列必须再次规范化
  • 将规范化列连接回数据框df
import requests
import pandas as pd

# load the data
df = pd.json_normalize(requests.get(url).json(), 'standings')

# explode the roundScores column
df = df.explode('roundScores').reset_index(drop=True)

# normalize the dicts in roundScores and join back to df
df = df.join(pd.json_normalize(df.roundScores), rsuffix='_rs').drop(columns=['roundScores']).reset_index(drop=True)

# display(df.head())
   isRecapAvailable player.identifier player.firstName player.lastName player.image.gravity player.image.type                     player.image.identifier player.image.cropMode player.country.name player.country.code player.country.flag.type player.country.flag.identifier  player.isAmateur  toPar.value toPar.format toPar.displayValue  toParToday.value toParToday.format toParToday.displayValue  totalScore.value totalScore.format totalScore.displayValue  position.value position.format position.displayValue  holesThrough.value holesThrough.format holesThrough.displayValue liveVideo.identifier liveVideo.isLive  score.value score.format score.displayValue  toPar.value_rs toPar.format_rs toPar.displayValue_rs
0              True             56278          Matthew           Wolff               center   imageCloudinary  us-open/players/2020-players/Matthew_Wolff                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -5     absolute                 -5                -5          absolute                      -5             140.0          absolute                     140               1        absolute                     1                  10            absolute                        10                  NaN              NaN           66     absolute                 66              -4        absolute                    -4
1              True             56278          Matthew           Wolff               center   imageCloudinary  us-open/players/2020-players/Matthew_Wolff                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -5     absolute                 -5                -5          absolute                      -5             140.0          absolute                     140               1        absolute                     1                  10            absolute                        10                  NaN              NaN           74     absolute                 74               4        absolute                    +4
2              True             56278          Matthew           Wolff               center   imageCloudinary  us-open/players/2020-players/Matthew_Wolff                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -5     absolute                 -5                -5          absolute                      -5             140.0          absolute                     140               1        absolute                     1                  10            absolute                        10                  NaN              NaN            0     absolute                                 -5        absolute                    -5
3              True             34360          Patrick            Reed               center   imageCloudinary   us-open/players/2019-players/Patrick-Reed                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -4     absolute                 -4                 0          absolute                       E             136.0          absolute                     136               2        absolute                     2                   7            absolute                         7                  NaN              NaN           66     absolute                 66              -4        absolute                    -4
4              True             34360          Patrick            Reed               center   imageCloudinary   us-open/players/2019-players/Patrick-Reed                  fill       United States                 usa          imageCloudinary              us-open/flags/usa             False           -4     absolute                 -4                 0          absolute                       E             136.0          absolute                     136               2        absolute                     2                   7            absolute                         7                  NaN              NaN           70     absolute                 70               0        absolute                     E

附加键

  • standings只是下载的 JSON 中的键之一
r = requests.get(url).json()

print(r)
[out]:
dict_keys(['currentRound', 'standings', 'fullLegend', 'shortLegend', 'inlineLegend', 'cutLine', 'meta'])

资源

于 2020-09-19T20:10:40.657 回答
2

您确实需要编写一些自定义代码来从 json 中获取您想要的内容。但是,如果您想将一些播放器详细信息放入 df 中,这里有一些灵感。

df = pd.DataFrame([x['player'] for x in data['standings']])
df['image'] = df['image'].apply(lambda x: x['identifier'])
df['country'] = df['country'].apply(lambda x: x['name'])
于 2020-09-19T19:46:08.233 回答
2

你可能想试试这个:

import requests
import pandas as pd


url = "https://gripapi-static-pd.usopen.com/gripapi/leaderboard.json"
data = pd.DataFrame.from_dict(requests.get(url).json()['standings'])

print(data['totalScore'])

输出:

0      {'value': 140, 'format': 'absolute', 'displayV...
1      {'value': 136, 'format': 'absolute', 'displayV...
2      {'value': 140, 'format': 'absolute', 'displayV...
3      {'value': 138, 'format': 'absolute', 'displayV...
4      {'value': 138, 'format': 'absolute', 'displayV...
                             ...                        
于 2020-09-19T19:44:28.180 回答
2

简单快捷的解决方案。使用 pandas 的 JSON 标准化可能存在更好的解决方案,但这对您的用例来说相当不错。

def func(x):
    if not any(x.isnull()):
        return (x['round'], x['player']['firstName'], x['player']['identifier'], x['toParToday']['value'], x['totalScore']['value'])

df = pd.DataFrame(data['standings'])
df['round'] = data['currentRound']['name']
df = df[['player', 'toPar', 'toParToday', 'totalScore', 'round']]
info = df.apply(func, axis=1)
info_df = pd.DataFrame(list(info.values), columns=['Round', 'player_name', 'pid', 'to_par_today', 'totalScore'])
info_df.head()
于 2020-09-19T20:39:46.913 回答