r - 用 r 蒸的话

Question

我很难理解 R 词干处理。

在我的示例中，我创建了以下语料库对象

a <- Corpus(VectorSource("device so much more funand  unlike most android torrent download clients"))

所以 a 是

a[[1]]$content

[1] "device so much more funand  unlike most android torrent download clients"

该字符串中的第一个单词是“设备”，我创建了术语矩阵

b <- TermDocumentMatrix(a, control = list(stemming = TRUE))

并将其作为输出

dimnames(b)$Terms
[1] "android"  "client"   "devic"    "download" "funand"   "more"     "most"      "much"     "torrent" 
[10] "unlik"

我想知道的是为什么我在“设备”和“不同”处丢失了“e”，但在“更多”处没有丢失它。

我怎样才能避免在这个词和其他一些词中发生这种情况？

谢谢。

score 0 · Accepted Answer

我假设您正在使用tmandSnowballC包。

这些包中的词干提取使用Porter Stemming 算法（英文）。

如果您想使用词干算法，您可以运行：

getStemLanguages()

并尝试使用其他人 - 唯一内置的其他英语在这里：

wordStem(words, language = "english")

对于您的数据，返回相同：

 [1] "android"  "client"   "devic"    "download" "funand"   "more"     "most"     "much"     "torrent" 
[10] "unlik"

score 0 · Accepted Answer

另一种选择是使用西北大学的MorphAdorner lemmatizer。这个答案有lemmatize(...)函数的代码。

library(tm)
a     <- Corpus(VectorSource("device so much more funand  unlike most android torrent download clients"))
words <- Terms(TermDocumentMatrix(a))
lemmatize(words)
#    android    clients     device   download     funand       more       most       much    torrent     unlike 
#  "android"   "client"   "device" "download"   "funand"     "more"     "most"     "much"  "torrent"   "unlike"

如您所见，它从“clients”中删除了“s”，但没有从“device”中删除“e”。

r - 用 r 蒸的话

2 回答 2

Related

Reference