r - 如何知道计算字符串之间的 Levenshtein 距离的操作？

Question

使用该函数stringdist，我可以计算字符串之间的 Levenshtein 距离：它计算将字符串转换为另一个字符串所需的删除、插入和替换的次数。例如，stringdist("abc abc","abcd abc") = 1因为“d”被插入到第二个字符串中。

是否有可能知道获得两个字符串之间的 Levenshtein 距离的操作？或者要知道两个字符串之间不同的字符（在这个例子中，只有“d”）？谢谢。

library(stringdist)
stringdist("abc abc","abcde acc") = 3

我想知道：

插入了“d”
插入了“e”
“b”被替换为“c”

或者更简单地说，我想要列表（“d”、“e”、“c”）。

score 10 · Accepted Answer

使用adist()，您可以检索操作：

drop(attr(adist("abc abc","abcde acc", count = TRUE), "counts"))

ins del sub 
  2   0   1

来自?adist：

如果 counts 为 TRUE，则转换计数作为该矩阵的“counts”属性返回，作为一个 3 维数组，其维度对应于 x 的元素、y 的元素以及转换的类型（插入、删除和替换），分别。

score 8 · Accepted Answer

这被称为Needleman-Wunsch 算法。它计算两个字符串之间的距离以及所谓的traceback，它允许您重建对齐方式。

由于这个问题在比较生物序列时主要出现在生物学中，因此该算法（和相关算法）在 R 包{Biostrings}中实现，它是Bioconductor的一部分。

由于这个包实现比简单的 Levenshtein 距离更通用的解决方案，不幸的是使用更复杂，使用小插曲也相应长。但您的基本用途如下：

library(Biostrings)

dist_mat = diag(27L)
colnames(dist_mat) = rownames(dist_mat) = c(letters, ' ')

result = pairwiseAlignment(
    "abc abc", "abcde acc",
    substitutionMatrix = dist_mat,
    gapOpening = 1, gapExtension = 1
)

但是，这不会简单地为您提供 list c('b', 'c', 'c')，因为该列表并不能完全代表此处实际发生的情况。相反，它将返回两个字符串之间的对齐方式。这可以表示为具有替换和间隙的序列：

score(result)
# [1] 3
aligned(result)
as.matrix(aligned(result))
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
# [1,] "a"  "b"  "c"  "-"  "-"  " "  "a"  "b"  "c"
aligned(result)

— 对于第二个字符串中的每个字符，它提供原始字符串中的相应字符，将插入的字符替换为-。基本上，这是将第一个字符串转换为第二个字符串的“秘诀”。请注意，它将仅包含插入和替换，而不包含删除。要获得这些，您需要以相反的方式执行对齐（即交换字符串参数）。

score 0 · Accepted Answer

这是提取每种类型的更改次数的代码，然后是每种类型的操作对应的字符：

source_string="12234"
target_string="02345"
lev=adist(source_string,target_string,count=T)

#number of operations of each kind
attributes(lev)$counts[,,"ins"] 
attributes(lev)$counts[,,"del"]
attributes(lev)$counts[,,"sub"]

substitution_bank=deletion_bank=insertion_bank=match_bank=NULL

changes<-strsplit(attributes(lev)$trafos, "")[[1]]

counter_source=counter_target=1
for(j in changes){
 if(j=="S") {
   substitution_bank=rbind(substitution_bank,
           cbind(strsplit(source_string,"")[[1]][counter_source], strsplit(target_string,"")[[1]][counter_target]))
   counter_source=counter_source+1
   counter_target=counter_target+1
 }
 if(j=="I") {
   insertion_bank=rbind(insertion_bank,
                           strsplit(target_string,"")[[1]][counter_target])
   counter_target=counter_target+1
 }
 if(j=="D") {
   deletion_bank=rbind(deletion_bank,
                        strsplit(source_string,"")[[1]][counter_source])
   counter_source=counter_source+1
 }
 if(j=="M") {
   match_bank=rbind(match_bank,
                           strsplit(source_string,"")[[1]][counter_source])
   counter_source=counter_source+1
   counter_target=counter_target+1
 }
 

}

substitution_bank
deletion_bank
insertion_bank
match_bank

老实说，我为代码感到羞耻——一次只输入一个字符似乎很浪费。但是在插入和删除的情况下，我无法弄清楚如何提取正确的字符......所以欢迎更优雅的答案！

r - 如何知道计算字符串之间的 Levenshtein 距离的操作？

3 回答 3

Related

Reference