你的问题有几点我不太明白;但是,我正在为您的问题提出解决方案。检查它是否适合您。
我假设您想为天气文本找到最匹配的标签。如果是这样,您可以通过以下方式使用stringsim
函数。library(stringdist)
首先注意:如果您清理\n
数据中的内容,结果会更准确。所以,我在这个例子中清理它们,但如果你愿意,你可以保留它们。
第二个注意事项:您可以根据不同的方法更改相似度距离。这里我使用了余弦相似度,这是一个比较好的起点。如果您想查看替代方法,请参阅函数的参考:
?stringsim
干净的数据如下:
df.word <- data.frame(
label = c("warm wet", "warm dry", "cold wet"),
synonym = c(
"hot and drizzling sunny and raining",
"sunny and clear sky dry sunny day",
"cold winds and raining snowing"
)
)
df.string <- data.frame(
day = c(1, 2, 3, 4),
weather = c(
"there would be some drizzling at dawn but we will have a hot day",
"today there are cold winds and a bit of raining or snowing at night",
"a sunny and clear sky is what we have today",
"a warm dry day"
)
)
安装库并加载它
install.packages('stringdist')
library(stringdist)
创建一个n x m
矩阵,其中包含每个是否具有每个同义词的文本的相似性分数。行显示每个文本是否代表每个同义词组。
match.scores <- sapply( ## Create a nested loop with sapply
seq_along(df.word$synonym), ## Loop for each synonym as 'i'
function(i) {
sapply(
seq_along(df.string$weather), ## Loop for each weather as 'j'
function(j) {
stringsim(df.word$synonym[i], df.string$weather[j], ## Check similarity
method = "cosine", ## Method cosine
q = 2 ## Size of the q -gram: 2
)
}
)
}
)
r$> match.scores
[,1] [,2] [,3]
[1,] 0.3657341 0.1919924 0.24629819
[2,] 0.6067799 0.2548236 0.73552828
[3,] 0.3333974 0.6300619 0.21791793
[4,] 0.1460593 0.4485426 0.03688556
获取每个文本行的最佳匹配,找到匹配分数最高的标签,并将这些标签添加到数据框中。
ranked.match <- apply(match.scores, 1, which.max)
df.string$extract <- df.word$label[ranked.match]
df.string
r$> df.string
day weather extract
1 1 there would be some drizzling at dawn but we will have a hot day warm wet
2 2 today there are cold winds and a bit of raining or snowing at night cold wet
3 3 a sunny and clear sky is what we have today warm dry
4 4 a warm dry day warm dry