0

我正在分析来自 Facebook 的政治广告,这是ProPublica在此处发布的数据集。

我想分析一整列'targets',但它的格式使得每个观察都是一个形式list(例如)。dictsstring"[{k1: v1}, {k2: v2}]"

import pandas as pd

data = {0: '[{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Segment", "segment": "Multicultural affinity: African American (US)."}, {"target": "Region", "segment": "the United States"}]', 1: '[{"target": "Age", "segment": "45 and older"}, {"target": "MinAge", "segment": "45"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]', 2: '[{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Region", "segment": "Texas"}, {"target": "List"}]', 3: '[]', 4: '[{"target": "Interest", "segment": "The Washington Post"}, {"target": "Gender", "segment": "men"}, {"target": "Age", "segment": "34 to 49"}, {"target": "MinAge", "segment": "34"}, {"target": "MaxAge", "segment": "49"}, {"target": "Region", "segment": "the United States"}]'}

df = pd.DataFrame.from_dict(data, orient='index', columns=['targets'])

# display(df)
                                                                                                                                                                                                                                                                            targets
0                                                   [{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Segment", "segment": "Multicultural affinity: African American (US)."}, {"target": "Region", "segment": "the United States"}]
1                                                 [{"target": "Age", "segment": "45 and older"}, {"target": "MinAge", "segment": "45"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}]
2                                                                                                                               [{"target": "Age", "segment": "18 and older"}, {"target": "MinAge", "segment": "18"}, {"target": "Region", "segment": "Texas"}, {"target": "List"}]
3                                                                                                                                                                                                                                                                                []
4  [{"target": "Interest", "segment": "The Washington Post"}, {"target": "Gender", "segment": "men"}, {"target": "Age", "segment": "34 to 49"}, {"target": "MinAge", "segment": "34"}, {"target": "MaxAge", "segment": "49"}, {"target": "Region", "segment": "the United States"}]

我需要将每个分隔"target" value成为列标题,每个对应"segment" value于该列中的一个值。

或者,是创建一个函数,调用每行中的每个字典键,计算频率的解决方案吗?

这应该是输出的样子:

           NAge MinAge                                   Retargeting             Region  ...                          Interest Location Granularity            Country Gender           NAge MinAge                                   Retargeting             Region  ...                          Interest Location Granularity            Country Gender
0  21 and older     21  people who may be similar to their customers  the United States  ...                               NaN                  NaN                NaN    NaN
1  18 and older     18                                           NaN                NaN  ...  Republican Party (United States)              country  the United States    NaN
2  18 and older     18                                           NaN                NaN  ...                               NaN              country  the United States  women```

Reddit 上有人发布了这个解决方案:

import json

for id,row in enumerate(df.targets):
    for d in json.loads(row):
        df.loc[id,d['target']] = d['segment']

df = df.drop(columns=['targets'])

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-53-339ae1670258> in <module>
      2 for id,row in enumerate(df.targets):
      3     for d in json.loads(row):
----> 4         df.loc[id,d['target']] = d['segment']
      5 
      6 df = df.drop(columns=['targets'])

KeyError: 'segment'
4

1 回答 1

2
  • def fix()没有向量化,即便如此,只需要 591 ms 即可应用于文件中的 222186 行。
  • NaN在列中替换,使用.fillna(),否则literal_eval会导致ValueError: malformed node or string: nan
  • 替换'null''None',否则literal_eval会导致ValueError: malformed node or string: <_ast.Name object at 0x000002219927A0A0>
  • 行中的值'targets'都是str类型,可以转换为listswith ast.literal_eval
  • def fix()遍历dictsin list,然后仅使用values来在 a 中创建一key-valuedict,从而将每个转换listdicts单个dict.
    • Emptylists替换为 empty dicts,这是.json_normalize()在列上工作所必需的。
  • pandas.json_normalized()然后可以很容易地在色谱柱上使用。
  • 另请参阅如何将带有 dicts 列表的 pandas 列拆分为每个键的单独列,以获得具有相同数据的替代方法。
import pandas as pd
from ast import literal_eval

# load the file
df = pd.read_csv('en-US.csv')

# replace NaNs with '[]', otherwise literal_eval will error
df.targets = df.targets.fillna('[]')

# replace null with None, otherwise literal_eval will error
df.targets = df.targets.str.replace('null', 'None')

# convert the strings to lists of dicts
df.targets = df.targets.apply(literal_eval)

# function to transform the list of dicts in each row
def fix(col):
    dd = dict()
    for d in col:
        values = list(d.values())
        if len(values) == 2:
            dd[values[0]] = values[1]
    return dd

# apply the function to targets
df.targets = df.targets.apply(fix)

# display(df.targets.head())
                                                                                                                                  targets
0     {'Age': '18 and older', 'MinAge': '18', 'Segment': 'Multicultural affinity: African American (US).', 'Region': 'the United States'}
1   {'Age': '45 and older', 'MinAge': '45', 'Retargeting': 'people who may be similar to their customers', 'Region': 'the United States'}
2                                                                              {'Age': '18 and older', 'MinAge': '18', 'Region': 'Texas'}
3                                                                                                                                      {}
4  {'Interest': 'The Washington Post', 'Gender': 'men', 'Age': '34 to 49', 'MinAge': '34', 'MaxAge': '49', 'Region': 'the United States'}

# normalize the targets column
normalized = pd.json_normalize(df.targets)

# join normalized back to df if desired
df = df.join(normalized).drop(columns=['targets'])

normalized宽格式,用于样本数据

# display(normalized.head())
            Age MinAge                                         Segment             Region                                   Retargeting             Interest Gender MaxAge
0  18 and older     18  Multicultural affinity: African American (US).  the United States                                           NaN                  NaN    NaN    NaN
1  45 and older     45                                             NaN  the United States  people who may be similar to their customers                  NaN    NaN    NaN
2  18 and older     18                                             NaN              Texas                                           NaN                  NaN    NaN    NaN
3           NaN    NaN                                             NaN                NaN                                           NaN                  NaN    NaN    NaN
4      34 to 49     34                                             NaN  the United States                                           NaN  The Washington Post    men     49

normalized宽格式,用于完整数据集

  • 从列中可以看出.info()包含targets多个不同的keys,但并非所有行都包含所有keys,所以有很多NaNs
  • 为了获得这种宽数据格式的唯一值计数,请使用类似normalized.Age.value_counts().
print(normalized.info())
[out]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222186 entries, 0 to 222185
Data columns (total 26 columns):
 #   Column                           Non-Null Count   Dtype 
---  ------                           --------------   ----- 
 0   Age                              157816 non-null  object
 1   MinAge                           156531 non-null  object
 2   Segment                          12288 non-null   object
 3   Region                           111638 non-null  object
 4   Retargeting                      39286 non-null   object
 5   Interest                         31514 non-null   object
 6   Gender                           7194 non-null    object
 7   MaxAge                           7767 non-null    object
 8   City                             23685 non-null   object
 9   State                            23685 non-null   object
 10  Website                          6235 non-null    object
 11  Language                         2584 non-null    object
 12  Audience Owner                   17859 non-null   object
 13  Location Granularity             29770 non-null   object
 14  Location Type                    29770 non-null   object
 15  Agency                           400 non-null     object
 16  List                             5034 non-null    object
 17  Custom Audience Match Key        1144 non-null    object
 18  Mobile App                       50 non-null      object
 19  Country                          22118 non-null   object
 20  Activity on the Facebook Family  3382 non-null    object
 21  Like                             855 non-null     object
 22  Education                        151 non-null     object
 23  Job Title                        15 non-null      object
 24  Relationship Status              22 non-null      object
 25  Employer                         4 non-null       object
dtypes: object(26)
memory usage: 44.1+ MB
于 2021-01-08T06:13:36.113 回答