pandas - 重新格式化包含标记的 pandas DataFrame，以便每个标记在其自己的行中

Question

假设我有带有 unique_ID 和 'tokenized_recipes' 列的 pandas DataFrame raw_corpus，如下所示：

unique_ID   tokenized_recipes
0   11530   ['photo', 'video', '500px', 'new', 'photo', 'from', 'anyone', 'tagged', 'with', 'phrase', 'change', 'new', 'tab', 'background', 'google', 'chrome', 'other']

1   17176   ['environment', 'control', 'monitoring', 'nest', 'protect', 'smoke', 'alarm', 'warning', 'activate', 'shortcut', 'wink', 'shortcuts', 'smart', 'hubs', 'systems']

2   6984    ['security', 'monitoring', 'systems', 'dlink', 'motion', 'sensor', 'motion', 'detected', 'post', 'to', 'channel', 'slack', 'communication']

我想重新组织这些数据并将其写入以制表符分隔的 csv，如下所示：

unique_ID   tokenized_recipes
11530       'photo'
11530       'video'
11530       '500px'
11530       'new'
 ...
17176       'environment'
17176       'control'
 ...

我尝试了上面链接的 2 个解决方案，有 11 个响应。我重新排序了我的数据框的 cols 以对应于解决方案的顺序。

我的数据框变量“tokenized_recipes”已经是一个列表。

更复杂的通用解决方案会产生一个错误，即我有一个零维数组。

然后我尝试用这段代码分解数据框 id_token 并得到 NameError: name 'Series' is not defined。

#now explode the dataframe id_token string entry to separate rows

pd.concat([Series(row['unique_ID'], 
row['tokenized_recipes'].split(','))
      for _, row in id_token.iterrows()]).reset_index()

pandas - 重新格式化包含标记的 pandas DataFrame，以便每个标记在其自己的行中

0 回答 0

Related

Reference