0

我正在尝试将深度嵌套的 JSON 导入熊猫数据框。这是 JSON 文件的结构(这只是第一条记录 (retweets[:1]):

[{'lang': 'en',
  'author_id': '1076979440372965377',
  'reply_settings': 'everyone',
  'entities': {'mentions': [{'start': 3,
     'end': 17,
     'username': 'Terry81987010',
     'url': '',
     'location': 'Florida',
     'entities': {'description': {'hashtags': [{'start': 29,
         'end': 32,
         'tag': '2A'}]}},
     'created_at': '2019-02-01T23:01:11.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 520,
      'following_count': 567,
      'tweet_count': 34376,
      'listed_count': 1},
     'name': "Terry's Take",
     'verified': False,
     'id': '1091471553437593605',
     'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
    {'start': 19,
     'end': 32,
     'username': 'DineshDSouza',
     'location': 'United States',
     'entities': {'url': {'urls': [{'start': 0,
         'end': 23,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
      'description': {'urls': [{'start': 80,
         'end': 103,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
     'created_at': '2009-11-22T22:32:41.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 1748832,
      'following_count': 5355,
      'tweet_count': 65674,
      'listed_count': 6966},
     'name': "Dinesh D'Souza",
     'verified': True,
     'pinned_tweet_id': '1393309917239562241',
     'id': '91882544',
     'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe: ",
     'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}]},
  'conversation_id': '1253462541881106433',
  'created_at': '2020-04-23T23:15:32.000Z',
  'id': '1253462541881106433',
  'possibly_sensitive': False,
  'referenced_tweets': [{'type': 'retweeted',
    'id': '1253052684489437184',
    'in_reply_to_user_id': '91882544',
    'attachments': {'media_keys': ['3_1253052312144293888',
      '3_1253052620937277442'],
     'media': [{}, {}]},
    'entities': {'annotations': [{'start': 126,
       'end': 128,
       'probability': 0.514,
       'type': 'Organization',
       'normalized_text': 'CDC'},
      {'start': 145,
       'end': 146,
       'probability': 0.5139,
       'type': 'Place',
       'normalized_text': 'NY'}],
     'mentions': [{'start': 0,
       'end': 13,
       'username': 'DineshDSouza',
       'location': 'United States',
       'entities': {'url': {'urls': [{'start': 0,
           'end': 23,
           'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
           'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
        'description': {'urls': [{'start': 80,
           'end': 103,
           'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
           'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
       'created_at': '2009-11-22T22:32:41.000Z',
       'protected': False,
       'public_metrics': {'followers_count': 1748832,
        'following_count': 5355,
        'tweet_count': 65674,
        'listed_count': 6966},
       'name': "Dinesh D'Souza",
       'verified': True,
       'pinned_tweet_id': '1393309917239562241',
       'id': '91882544',
       'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe:  ",
       'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}],
     'urls': [{'start': 187,
       'end': 210,
       'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
       'display_url': 'pic.twitter.com/H4NpN5ZMkW'},
      {'start': 187,
       'end': 210,
       'expanded_url': 'https://twitter.com/Terry81987010/status/1253052684489437184/photo/1',
       'display_url': 'pic.twitter.com/H4NpN5ZMkW'}]},
    'lang': 'en',
    'author_id': '1091471553437593605',
    'reply_settings': 'everyone',
    'conversation_id': '1253050942716551168',
    'created_at': '2020-04-22T20:06:55.000Z',
    'possibly_sensitive': False,
    'referenced_tweets': [{'type': 'replied_to', 'id': '1253050942716551168'}],
    'public_metrics': {'retweet_count': 208,
     'reply_count': 57,
     'like_count': 402,
     'quote_count': 38},
    'source': 'Twitter Web App',
    'text': "@DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according to the CDC. And of course NY getting paid for each cv death $60,000",
    'context_annotations': [{'domain': {'id': '10',
       'name': 'Person',
       'description': 'Named people in the world like Nelson Mandela'},
      'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
     {'domain': {'id': '35',
       'name': 'Politician',
       'description': 'Politicians in the world, like Joe Biden'},
      'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
    'author': {'url': '',
     'username': 'Terry81987010',
     'location': 'Florida',
     'entities': {'description': {'hashtags': [{'start': 29,
         'end': 32,
         'tag': '2A'}]}},
     'created_at': '2019-02-01T23:01:11.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 520,
      'following_count': 567,
      'tweet_count': 34376,
      'listed_count': 1},
     'name': "Terry's Take",
     'verified': False,
     'id': '1091471553437593605',
     'description': 'Less government more Freedom #2A is a constitutional right. Trump2020, common sense rules, God bless America! Vet 82nd Airborne F/A, proud Republican',
     'profile_image_url': 'https://pbs.twimg.com/profile_images/1289626661911134208/WfztLkr1_normal.jpg'},
    'in_reply_to_user': {'username': 'DineshDSouza',
     'location': 'United States',
     'entities': {'url': {'urls': [{'start': 0,
         'end': 23,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]},
      'description': {'urls': [{'start': 80,
         'end': 103,
         'expanded_url': 'https://podcasts.apple.com/us/podcast/the-dinesh-dsouza-podcast/id1547827376',
         'display_url': 'podcasts.apple.com/us/podcast/the…'}]}},
     'created_at': '2009-11-22T22:32:41.000Z',
     'protected': False,
     'public_metrics': {'followers_count': 1748832,
      'following_count': 5355,
      'tweet_count': 65674,
      'listed_count': 6966},
     'name': "Dinesh D'Souza",
     'verified': True,
     'pinned_tweet_id': '1393309917239562241',
     'id': '91882544',
     'description': "I am an author, filmmaker, and host of the Dinesh D'Souza Podcast.\n\nSubscribe:  ",
     'profile_image_url': 'https://pbs.twimg.com/profile_images/890967538292711424/8puyFbiI_normal.jpg'}}],
  'public_metrics': {'retweet_count': 208,
   'reply_count': 0,
   'like_count': 0,
   'quote_count': 0},
  'source': 'Twitter for iPhone',
  'text': "RT @Terry81987010: @DineshDSouza Here's some proof of artificially inflating the cv deaths. Noone is dying of pneumonia anymore according t…",
  'context_annotations': [{'domain': {'id': '10',
     'name': 'Person',
     'description': 'Named people in the world like Nelson Mandela'},
    'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}},
   {'domain': {'id': '35',
     'name': 'Politician',
     'description': 'Politicians in the world, like Joe Biden'},
    'entity': {'id': '1138120064119369729', 'name': "Dinesh D'Souza"}}],
  'author': {'url': '',
   'username': 'set1952',
   'location': 'Etats-Unis',
   'created_at': '2018-12-23T23:14:42.000Z',
   'protected': False,
   'public_metrics': {'followers_count': 103,
    'following_count': 44,
    'tweet_count': 44803,
    'listed_count': 0},
   'name': 'SunSet1952',
   'verified': False,
   'id': '1076979440372965377',
   'description': '',
   'profile_image_url': 'https://abs.twimg.com/sticky/default_profile_images/default_profile_normal.png'},
  '__twarc': {'url': 'https://api.twitter.com/2/tweets/search/all?expansions=author_id%2Cin_reply_to_user_id%2Creferenced_tweets.id%2Creferenced_tweets.id.author_id%2Centities.mentions.username%2Cattachments.poll_ids%2Cattachments.media_keys%2Cgeo.place_id&user.fields=created_at%2Cdescription%2Centities%2Cid%2Clocation%2Cname%2Cpinned_tweet_id%2Cprofile_image_url%2Cprotected%2Cpublic_metrics%2Curl%2Cusername%2Cverified%2Cwithheld&tweet.fields=attachments%2Cauthor_id%2Ccontext_annotations%2Cconversation_id%2Ccreated_at%2Centities%2Cgeo%2Cid%2Cin_reply_to_user_id%2Clang%2Cpublic_metrics%2Ctext%2Cpossibly_sensitive%2Creferenced_tweets%2Creply_settings%2Csource%2Cwithheld&media.fields=duration_ms%2Cheight%2Cmedia_key%2Cpreview_image_url%2Ctype%2Curl%2Cwidth%2Cpublic_metrics&poll.fields=duration_minutes%2Cend_datetime%2Cid%2Coptions%2Cvoting_status&place.fields=contained_within%2Ccountry%2Ccountry_code%2Cfull_name%2Cgeo%2Cid%2Cname%2Cplace_type&max_results=500&query=retweets_of%3ATerry81987010&start_time=2020-03-09T00%3A00%3A00%2B00%3A00&end_time=2020-04-24T00%3A00%3A00%2B00%3A00',
   'version': '2.0.8',
   'retrieved_at': '2021-05-17T17:13:17+00:00'}},

这是我的代码:

retweets = []
for line in open('Data/usersRetweetsFlatten_sample.json', 'r'):
    retweets.append(json.loads(line))

df = json_normalize(
    retweets, 'referenced_tweets', ['referenced_tweets', 'type'],
    meta_prefix= ".",
    errors='ignore'
)
df[['author_id',  'type', '.type', 'id', 'in_reply_to_user_id', 'referenced_tweets']].head()

这是生成的数据框:

在此处输入图像描述

如您所见,该列referenced_tweets尚未展平(请注意,referenced_tweets我的 JSON 文件中有两个不同的数组:一个位于另一个“referenced_tweets”内部的更深层次)。例如,更高级别的返回:

 >>> retweets[0]["referenced_tweets"][0]["type"]
"retweeted"

而更深层次的返回:

>>> retweets[0]["referenced_tweets"][0]["referenced_tweets"][0]["type"]
'replied_to'

问题:我想知道如何才能使更深的referenced_tweets. 我想要两个单独的列作为referenced_tweets.typereferenced_tweets.id,上面示例中列的值referenced_tweets.type应该是replied_to

4

1 回答 1

1

我认为这里的问题是您的数据是双重嵌套的referenced_tweets...... referenced_tweets.

import json
from pandas import json_normalize

with open("flatten.json", "r") as file:
    data = json.load(file)

df = json_normalize(
    data, 
    record_path=["referenced_tweets", "referenced_tweets"],
    meta=[
        "author_id",
        
        # ["author", "username"], # not possible
        # "author", # possible but not useful
        
        ["referenced_tweets", "id"],
        ["referenced_tweets", "type"],
        ["referenced_tweets", "in_reply_to_user_id"],
        ["referenced_tweets", "in_reply_to_user", "username"], 
    ]
)

print(df)

在此处输入图像描述

另请参阅:https ://stackoverflow.com/a/37668569/42659


注意:如果缺少第二个嵌套,上面的代码将失败referenced_tweet


编辑:或者,您可以通过额外的手动迭代进一步规范化您的问题中的数据(您已经用代码部分规范化了这些数据)。请参见下面的示例。注意:代码未优化,根据数据量可能会很慢。

# load your `data` with `json.load()` or `json.loads()`

df = json_normalize(
    data, 
    record_path="referenced_tweets",
    meta=["referenced_tweets", "type"],
    meta_prefix= ".",
    errors="ignore",
)

columns = [*df.columns, "_type", "_id"]
normalized_data = []

def append(row, type, id):
    normalized_row = [*row.to_list(), type, id]
    normalized_data.append(normalized_row)

for _, row in df.iterrows():
    
    # a list/array is expected
    if type(row["referenced_tweets"]) is list:
        for tweet in row["referenced_tweets"]:
            append(row, tweet["type"], tweet["id"])
        # if empty list
        else:
            append(row, None, None)
    else:
        append(row, None, None)

enhanced_df = pd.DataFrame(data=normalized_data, columns=columns)
enhanced_df.drop("referenced_tweets", 1)

print(enhanced_df)

在此处输入图像描述


编辑2: referenced_tweets应该是一个数组。但是,如果没有引用的推文,Twitter API 似乎会referenced_tweets完全省略。在这种情况下,单元格值为NaN(float) 而不是空列表。我更新了上面的代码以考虑到这一点。

于 2021-08-06T22:22:07.803 回答