我做了类似的事情,我首先创建了这两个术语的所有组合。
dat<-tibble::tribble(
~Verbatim, ~LowestlevelTerm,
"Acute Bronchitis", "Acute Bronchitis",
"Sinusitis Maxillaris Acuta", "Acute Maxillary Sinusitis",
"Increase In Eosinophils", "Eosinophil Count Increased",
"Bronchitis Acuta", "Bronchitis Acute",
"Acute Sinusitis Maxillaris", "Acute Sinusitis, Maxillary",
"Eosinophil Increase", "Eosinophil Count Increased",
"Increase In Eosinophilia", "Eosinophilia"
)
dat3 <- merge(dat, dat, by = NULL) %>%
filter(Verbatim.x != Verbatim.y) %>%
select(Verbatim.x, LowestlevelTerm.y) %>%
distinct()
然后我计算了一堆不同的指标stringdist
。出于此答案的目的,我将全部展示,但使用 levenshtein 编辑距离作为您的“聚类”指标。换句话说,将为lev
每个 Verbatim 找到每个唯一组合的最小化。
library(stringdist)
dat3 <- merge(dat, dat, by = NULL) %>%
filter(Verbatim.x != Verbatim.y) %>%
select(Verbatim.x, LowestlevelTerm.y) %>%
distinct() %>%
mutate(
lev = stringdist(Verbatim.x, LowestlevelTerm.y, method = "lv") #like lcs, but permits substitutions
,osa = stringdist(Verbatim.x, LowestlevelTerm.y, method = "osa") #lv + transpositions of adjacent characters
,dl = stringdist(Verbatim.x, LowestlevelTerm.y, method = "dl") # i think, is similar to osa but can transpose non-adjacent characters
,lcs = stringdist(Verbatim.x, LowestlevelTerm.y, method = "lcs") # edit distance using insertions and deletions
,qgram = stringdist(Verbatim.x, LowestlevelTerm.y, method = "qgram", q = 2) #counts q-grams that are not shared
,cosine = stringdist(Verbatim.x, LowestlevelTerm.y, method = "cosine") # more complicated math than the other q-gram methods
,jaccard = stringdist(Verbatim.x, LowestlevelTerm.y, method = "jaccard", q = 2) #compare q-grams, 0 is all matching, 1 is none matching
) %>%
arrange(Verbatim.x, lev)
从这一点来看,它更像是艺术而不是科学。使用 lev < 15 的截止值似乎可以很好地“聚集”相似的事物。
head(dat3, 20)
Verbatim.x LowestlevelTerm.y lev osa dl lcs qgram cosine jaccard
1 Acute Bronchitis Bronchitis Acute 12 12 12 12 4 0.00000000 0.2352941
2 Acute Bronchitis Eosinophilia 12 12 12 18 24 0.47559558 0.9600000
3 Acute Bronchitis Acute Maxillary Sinusitis 13 13 13 17 23 0.26902612 0.7419355
4 Acute Bronchitis Acute Sinusitis, Maxillary 16 16 16 20 24 0.27637277 0.7500000
5 Acute Bronchitis Eosinophil Count Increased 23 23 23 30 36 0.27700119 0.9473684
6 Acute Sinusitis Maxillaris Acute Bronchitis 16 16 16 20 24 0.26893401 0.7419355
7 Acute Sinusitis Maxillaris Acute Maxillary Sinusitis 17 17 17 19 5 0.02028473 0.1538462
8 Acute Sinusitis Maxillaris Bronchitis Acute 20 20 20 30 24 0.26893401 0.7419355
9 Acute Sinusitis Maxillaris Eosinophilia 21 21 21 28 30 0.34684389 0.9062500
10 Acute Sinusitis Maxillaris Eosinophil Count Increased 24 24 24 38 44 0.34461985 0.9347826
11 Bronchitis Acuta Acute Bronchitis 12 12 12 12 6 0.04545455 0.3333333
12 Bronchitis Acuta Eosinophilia 13 13 13 16 24 0.42792245 0.9600000
13 Bronchitis Acuta Acute Sinusitis, Maxillary 19 19 19 28 28 0.24622164 0.8235294
14 Bronchitis Acuta Eosinophil Count Increased 20 20 20 26 36 0.30843593 0.9473684
15 Bronchitis Acuta Acute Maxillary Sinusitis 21 21 21 29 27 0.23856887 0.8181818
16 Eosinophil Increase Eosinophil Count Increased 7 7 7 7 7 0.06910435 0.2800000
17 Eosinophil Increase Eosinophilia 8 8 8 9 11 0.21106794 0.5500000
18 Eosinophil Increase Bronchitis Acute 15 15 15 21 29 0.32696355 0.9354839
19 Eosinophil Increase Acute Bronchitis 17 17 17 25 29 0.32696355 0.9354839
20 Eosinophil Increase Acute Maxillary Sinusitis 21 21 21 34 36 0.36333027 0.9230769