我正在尝试使用 r 中的 jieba 包将“content”列中的中文句子分割成单词,然后创建一个新的对应列“words”,其中每一行包含上一个“content”列中对应行的分割词。
df$content (3 rows):
我喜歡吃雞翅;我不喜歡吃雞;哇這是什麼醬做得雞翅?
desired df$words (3 rows):
我 喜歡 吃 雞翅;我 不 喜歡 吃 雞;哇 這 是 什麼 醬 做 得 雞翅?
其中 words 列与 content 列的分段版本有 3 个对应的行。
jieba 包很好地分割了中文单词,但是我在将分割的单词保持在 1 行内时遇到了麻烦。jieba分词器似乎将“内容”列的所有单词都进行了分段,然后将每个单词视为单独的行。我真的被困在如何解决这个问题上——我需要改变回收的向量数量吗?任何帮助将不胜感激。
这是我的代码:
df$words <- qseg <= df$content
返回错误:
Error: Assigned data `df$words <- qseg <= df$content` must be compatible with existing data. x Existing data has 29175 rows. x Assigned data has 1327701 rows. ℹ Only vectors of size 1 are recycled. Run `rlang::last_error()` to see where the error occurred.
15.
stop(fallback)
14.
signal_abort(cnd)
13.
cnd_signal(error_assign_incompatible_size(nrow, value, j, i_arg, value_arg))
12.
(function (cnd) { cnd_signal(error_assign_incompatible_size(nrow, value, j, i_arg, value_arg)) ...
11.
signalCondition(cnd)
10.
signal_abort(cnd)
9.
abort(message, class = c(class, "vctrs_error"), ...)
8.
stop_vctrs(x_size = x_size, y_size = size, x_arg = x_arg, class = c("vctrs_error_incompatible_size", "vctrs_error_recycle_incompatible_size"))
7.
stop_recycle_incompatible_size(x_size = 1327701L, size = 29175L, x_arg = "")
6.
vec_recycle(value[[j]], nrow)
5.
withCallingHandlers(for (j in seq_along(value)) { if (!is.null(value[[j]])) { value[[j]] <- vec_recycle(value[[j]], nrow) } ...
4.
vectbl_recycle_rhs(value, fast_nrow(x), length(j), i_arg = NULL, value_arg)
3.
tbl_subassign(x, i = NULL, as_string(name), list(value), i_arg = NULL, j_arg = name, value_arg = substitute(value))
2.
`$<-.tbl_df`(`*tmp*`, testing, value = c("网友", "爆料", "网友", "在", "宝鸡", "贴", "吧", "发帖", "称", "有人", "在", "铁路", "门口", "摆放", "花圈", "灵堂", "抗议", "据", "未", "证实", "消息", "说", "期间", "新", "与", ...
1.
`$<-`(`*tmp*`, testing, value = c("网友", "爆料", "网友", "在", "宝鸡", "贴", "吧", "发帖", "称", "有人", "在", "铁路", "门口", "摆放", "花圈", "灵堂", "抗议", "据", "未", "证实", "消息", "说", "期间", "新", "与", "争执", ...