python - json_normalize 在尝试提取某些属性时产生 KeyError

Question

这是我的 json 文件的一个子集：

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

我想将它放入一个数据框中，每个问题和每个答案的一行。

Python代码：

from pandas import json_normalize
import json

fields = ['text','answers.text']

with open(R'response.json') as f:
    d = json.load(f)

data = json_normalize(d['data'],['questions'],errors='ignore')
data = data[fields]

print(data)

这会产生一个 KeyError：

KeyError: "['answers.text'] not in index"

在这里待了几个小时，绝对无法弄清楚。我觉得它应该很简单，但它从来都不是。

score 1 · Accepted Answer

使用record_prefix,record_path和meta, sod可以一次性归一化
- pd.json_normalize将导致 a当and之间ValueError有重叠key的名称，并且and都在两者中。record_pathmeta'id''text'
- ValueError: Conflicting metadata name id, need distinguishing prefix不使用record_path.
KeyError发生是因为不在'answers.text'，d它是由.json_normalize()
keys如果中不需要任何顶级df，请将它们从中删除meta。

import pandas as pd

# normalize d
df = pd.json_normalize(data=d['data']['questions'],
                       record_path= ['answers'],
                       meta=['id', 'text', 'instructionalText', 'minimumResponses', 'maximumResponses', 'sortOrder'],
                       record_prefix='answers_')

# display(df)
   answers_id answers_text answers_parentId    id         text     instructionalText minimumResponses maximumResponses sortOrder
0      362949    Answer #1             None  6574  Question #1                                      0             None         1
1      362950    Answer #2             None  6574  Question #1                                      0             None         1
2      362951    Answer #3             None  6574  Question #1                                      0             None         1
3      362952    Answer #4             None  6574  Question #1                                      0             None         1
4      262949    Answer #1             None  4756  Question #2  No cheating, cheater                0             None         1
5      262950    Answer #2             None  4756  Question #2  No cheating, cheater                0             None         1
6      262951    Answer #3             None  4756  Question #2  No cheating, cheater                0             None         1
7      262952    Answer #4             None  4756  Question #2  No cheating, cheater                0             None         1

扩展测试数据

d = {'data': {'questions': [{'id': 6574,
                             'text': 'Question #1',
                             'instructionalText': '',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 362950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 362951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 362952, 'text': 'Answer #4', 'parentId': None}]},
                            {'id': 4756,
                             'text': 'Question #2',
                             'instructionalText': 'No cheating, cheater',
                             'minimumResponses': 0,
                             'maximumResponses': None,
                             'sortOrder': 1,
                             'answers': [{'id': 262949, 'text': 'Answer #1', 'parentId': None},
                                         {'id': 262950, 'text': 'Answer #2', 'parentId': None},
                                         {'id': 262951, 'text': 'Answer #3', 'parentId': None},
                                         {'id': 262952, 'text': 'Answer #4', 'parentId': None}]}]}}

关于另一个答案，.apply(pd.Series)不推荐使用，因为它非常慢。
- 请参阅SO：Splitting dictionary/list inside a Pandas Column into separate Columns中的时序分析
- 10M 行 53 分钟

score 0 · Accepted Answer

这是我通常使用的技术

json_normalize()顶级列表
explode()孩子list，reset_index()对于第 3 步
dict在孩子list内扩展apply(pd.Series)

d = {'data': {'questions': [{'id': 6574,
    'text': 'Question #1',
    'instructionalText': '',
    'minimumResponses': 0,
    'maximumResponses': None,
    'sortOrder': 1,
    'answers': [{'id': 362949, 'text': 'Answer #1', 'parentId': None},
     {'id': 362950, 'text': 'Answer #2', 'parentId': None},
     {'id': 362951, 'text': 'Answer #3', 'parentId': None},
     {'id': 362952, 'text': 'Answer #4', 'parentId': None}]}]}}

df = pd.json_normalize(d["data"]["questions"]).explode("answers").reset_index(drop=True)
df = df.join(df["answers"].apply(pd.Series), rsuffix="_ans").drop(columns="answers")

	ID	文本	排序	id_ans	text_ans
0	6574	问题 #1	1	362949	答案#1
1	6574	问题 #1	1	362950	答案#2
2	6574	问题 #1	1	362951	答案#3
3	6574	问题 #1	1	362952	答案#4

python - json_normalize 在尝试提取某些属性时产生 KeyError

2 回答 2

扩展测试数据

Related

Reference