我正在分析来自 Facebook 的政治广告,这是ProPublica在此处发布的数据集。
我想分析一整列'targets'
,但它的格式使得每个观察都是一个形式list
(例如)。dicts
string
"[{k1: v1}, {k2: v2}]"
import pandas as pd
data = {0: '[{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Segment", "segment": "Multicultural affinity: African American (US)."}, {"target": "Region", "segment": "the United States"}]', 1: '[{"target": "Age", "segment": "45 and older"}, {"target": "MinAge", "segment": "45"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]', 2: '[{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Region", "segment": "Texas"}, {"target": "List"}]', 3: '[]', 4: '[{"target": "Interest", "segment": "The Washington Post"}, {"target": "Gender", "segment": "men"}, {"target": "Age", "segment": "34 to 49"}, {"target": "MinAge", "segment": "34"}, {"target": "MaxAge", "segment": "49"}, {"target": "Region", "segment": "the United States"}]'}
df = pd.DataFrame.from_dict(data, orient='index', columns=['targets'])
# display(df)
targets
0 [{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Segment", "segment": "Multicultural affinity: African American (US)."}, {"target": "Region", "segment": "the United States"}]
1 [{"target": "Age", "segment": "45 and older"}, {"target": "MinAge", "segment": "45"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]
2 [{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Region", "segment": "Texas"}, {"target": "List"}]
3 []
4 [{"target": "Interest", "segment": "The Washington Post"}, {"target": "Gender", "segment": "men"}, {"target": "Age", "segment": "34 to 49"}, {"target": "MinAge", "segment": "34"}, {"target": "MaxAge", "segment": "49"}, {"target": "Region", "segment": "the United States"}]
我需要将每个分隔"target"
value
成为列标题,每个对应"segment"
value
于该列中的一个值。
或者,是创建一个函数,调用每行中的每个字典键,计算频率的解决方案吗?
这应该是输出的样子:
NAge MinAge Retargeting Region ... Interest Location Granularity Country Gender NAge MinAge Retargeting Region ... Interest Location Granularity Country Gender
0 21 and older 21 people who may be similar to their customers the United States ... NaN NaN NaN NaN
1 18 and older 18 NaN NaN ... Republican Party (United States) country the United States NaN
2 18 and older 18 NaN NaN ... NaN country the United States women```
Reddit 上有人发布了这个解决方案:
import json
for id,row in enumerate(df.targets):
for d in json.loads(row):
df.loc[id,d['target']] = d['segment']
df = df.drop(columns=['targets'])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-53-339ae1670258> in <module>
2 for id,row in enumerate(df.targets):
3 for d in json.loads(row):
----> 4 df.loc[id,d['target']] = d['segment']
5
6 df = df.drop(columns=['targets'])
KeyError: 'segment'