r - 在R中保持通过模糊匹配匹配的最佳字符串

Question

我在 R 中有两个数据框。一个是我想要匹配的短语的数据框以及它们在另一列（df.word）中的同义词，另一个是我想要匹配的字符串的数据框以及代码（df.string）。字符串很复杂，但为了方便起见，我们有：

df.word <- data.frame(label = c('warm wet', 'warm dry', 'cold wet'),
                 synonym = c('hot and drizzling\nsunny and raining','sunny and clear sky\ndry sunny day', 'cold winds and raining\nsnowing'))

df.string <- data.frame(day = c(1,2,3,4),
                       weather = c('there would be some drizzling at dawn but we will have a hot day', 'today there are cold winds and a bit of raining or snowing at night', 'a sunny and clear sky is what we have today', 'a warm dry day'))

我想创建 df.string$extract ，我希望在其中为字符串提供最佳匹配。

像这样的专栏

df$extract <- c('warm wet', 'cold wet', 'warm dry', 'warm dry')

提前感谢任何人的帮助。

score 2 · Accepted Answer

你的问题有几点我不太明白；但是，我正在为您的问题提出解决方案。检查它是否适合您。

我假设您想为天气文本找到最匹配的标签。如果是这样，您可以通过以下方式使用stringsim函数。library(stringdist)

首先注意：如果您清理\n数据中的内容，结果会更准确。所以，我在这个例子中清理它们，但如果你愿意，你可以保留它们。

第二个注意事项：您可以根据不同的方法更改相似度距离。这里我使用了余弦相似度，这是一个比较好的起点。如果您想查看替代方法，请参阅函数的参考：

?stringsim

干净的数据如下：

df.word <- data.frame(
    label = c("warm wet", "warm dry", "cold wet"),
    synonym = c(
        "hot and drizzling sunny and raining",
        "sunny and clear sky dry sunny day", 
        "cold winds and raining snowing"
    )
)

df.string <- data.frame(
    day = c(1, 2, 3, 4),
    weather = c(
        "there would be some drizzling at dawn but we will have a hot day",
        "today there are cold winds and a bit of raining or snowing at night", 
        "a sunny and clear sky is what we have today", 
        "a warm dry day"
    )
)

安装库并加载它

install.packages('stringdist')
library(stringdist)

创建一个n x m矩阵，其中包含每个是否具有每个同义词的文本的相似性分数。行显示每个文本是否代表每个同义词组。

match.scores <- sapply(          ## Create a nested loop with sapply
    seq_along(df.word$synonym),  ## Loop for each synonym as 'i'
    function(i) {
        sapply(
            seq_along(df.string$weather), ## Loop for each weather as 'j'
            function(j) {
                stringsim(df.word$synonym[i], df.string$weather[j], ## Check similarity 
                    method = "cosine", ## Method cosine  
                    q = 2 ## Size of the q -gram: 2 
                )
            }
        )
    }
)

r$> match.scores
          [,1]      [,2]       [,3]
[1,] 0.3657341 0.1919924 0.24629819
[2,] 0.6067799 0.2548236 0.73552828
[3,] 0.3333974 0.6300619 0.21791793
[4,] 0.1460593 0.4485426 0.03688556

获取每个文本行的最佳匹配，找到匹配分数最高的标签，并将这些标签添加到数据框中。

ranked.match <- apply(match.scores, 1, which.max)
df.string$extract <- df.word$label[ranked.match]

df.string

r$> df.string
  day                                                             weather  extract
1   1    there would be some drizzling at dawn but we will have a hot day warm wet
2   2 today there are cold winds and a bit of raining or snowing at night cold wet
3   3                         a sunny and clear sky is what we have today warm dry
4   4                                                      a warm dry day warm dry

r - 在R中保持通过模糊匹配匹配的最佳字符串

1 回答 1

Related

Reference