-2

我知道在这个/不同的论坛中可能会提出类似的问题,但我觉得我的要求不同。我有 2 列数据框,如下所示:

逐字 最低级术语

急性支气管炎 急性支气管炎

急性上颌窦炎 急性上颌窦炎

嗜酸性粒细胞增加 嗜酸性粒细胞计数增加

急性支气管炎 急性支气管炎

急性上颌窦炎 急性上颌窦炎

嗜酸性粒细胞增加 嗜酸性粒细胞计数增加

嗜酸性粒细胞增多

我正在尝试使用我的代码获得以下输出,但我没有找到任何运气

Verbatim LowestlevelTerm 集群 id

急性支气管炎 急性支气管炎 1

急性支气管炎 急性支气管炎 1

急性上颌窦炎 急性上颌窦炎 2

急性上颌窦炎 急性上颌窦炎 2

嗜酸性粒细胞增加 嗜酸性粒细胞计数增加 3

嗜酸性粒细胞增加 嗜酸性粒细胞计数增加 3

嗜酸性粒细胞增多 嗜酸性粒细胞增多 3

我用来满足我的要求的代码

new_df <- df %>%
  group_by(LowestlevelTerm) %>%
  summarise(Clusterid = toString(ID))

您能否让我知道是否有任何简单的方法可以使用任何其他功能对这些术语进行聚类?

4

1 回答 1

0

我做了类似的事情,我首先创建了这两个术语的所有组合。

dat<-tibble::tribble(
                     ~Verbatim,             ~LowestlevelTerm,
            "Acute Bronchitis",           "Acute Bronchitis",
  "Sinusitis Maxillaris Acuta",  "Acute Maxillary Sinusitis",
     "Increase In Eosinophils", "Eosinophil Count Increased",
            "Bronchitis Acuta",           "Bronchitis Acute",
  "Acute Sinusitis Maxillaris", "Acute Sinusitis, Maxillary",
         "Eosinophil Increase", "Eosinophil Count Increased",
    "Increase In Eosinophilia",               "Eosinophilia"
  )

dat3 <- merge(dat, dat, by = NULL) %>%
  filter(Verbatim.x != Verbatim.y) %>%
  select(Verbatim.x, LowestlevelTerm.y) %>%
  distinct()

然后我计算了一堆不同的指标stringdist。出于此答案的目的,我将全部展示,但使用 levenshtein 编辑距离作为您的“聚类”指标。换句话说,将为lev每个 Verbatim 找到每个唯一组合的最小化。

library(stringdist)
     dat3 <- merge(dat, dat, by = NULL) %>%
      filter(Verbatim.x != Verbatim.y) %>%
      select(Verbatim.x, LowestlevelTerm.y) %>%
      distinct() %>%
  mutate(
    lev = stringdist(Verbatim.x, LowestlevelTerm.y, method = "lv") #like lcs, but permits substitutions
    ,osa = stringdist(Verbatim.x, LowestlevelTerm.y, method = "osa") #lv + transpositions of adjacent characters
    ,dl = stringdist(Verbatim.x, LowestlevelTerm.y, method = "dl") # i think, is similar to osa but can transpose non-adjacent characters
    ,lcs = stringdist(Verbatim.x, LowestlevelTerm.y, method = "lcs") # edit distance using insertions and deletions
    ,qgram = stringdist(Verbatim.x, LowestlevelTerm.y, method = "qgram", q = 2) #counts q-grams that are not shared
    ,cosine = stringdist(Verbatim.x, LowestlevelTerm.y, method = "cosine") # more complicated math than the other q-gram methods
    ,jaccard = stringdist(Verbatim.x, LowestlevelTerm.y, method = "jaccard", q = 2) #compare q-grams, 0 is all matching, 1 is none matching
  ) %>%
  arrange(Verbatim.x, lev)

从这一点来看,它更像是艺术而不是科学。使用 lev < 15 的截止值似乎可以很好地“聚集”相似的事物。

head(dat3, 20)
                   Verbatim.x          LowestlevelTerm.y lev osa dl lcs qgram     cosine   jaccard
1            Acute Bronchitis           Bronchitis Acute  12  12 12  12     4 0.00000000 0.2352941
2            Acute Bronchitis               Eosinophilia  12  12 12  18    24 0.47559558 0.9600000
3            Acute Bronchitis  Acute Maxillary Sinusitis  13  13 13  17    23 0.26902612 0.7419355
4            Acute Bronchitis Acute Sinusitis, Maxillary  16  16 16  20    24 0.27637277 0.7500000
5            Acute Bronchitis Eosinophil Count Increased  23  23 23  30    36 0.27700119 0.9473684
6  Acute Sinusitis Maxillaris           Acute Bronchitis  16  16 16  20    24 0.26893401 0.7419355
7  Acute Sinusitis Maxillaris  Acute Maxillary Sinusitis  17  17 17  19     5 0.02028473 0.1538462
8  Acute Sinusitis Maxillaris           Bronchitis Acute  20  20 20  30    24 0.26893401 0.7419355
9  Acute Sinusitis Maxillaris               Eosinophilia  21  21 21  28    30 0.34684389 0.9062500
10 Acute Sinusitis Maxillaris Eosinophil Count Increased  24  24 24  38    44 0.34461985 0.9347826
11           Bronchitis Acuta           Acute Bronchitis  12  12 12  12     6 0.04545455 0.3333333
12           Bronchitis Acuta               Eosinophilia  13  13 13  16    24 0.42792245 0.9600000
13           Bronchitis Acuta Acute Sinusitis, Maxillary  19  19 19  28    28 0.24622164 0.8235294
14           Bronchitis Acuta Eosinophil Count Increased  20  20 20  26    36 0.30843593 0.9473684
15           Bronchitis Acuta  Acute Maxillary Sinusitis  21  21 21  29    27 0.23856887 0.8181818
16        Eosinophil Increase Eosinophil Count Increased   7   7  7   7     7 0.06910435 0.2800000
17        Eosinophil Increase               Eosinophilia   8   8  8   9    11 0.21106794 0.5500000
18        Eosinophil Increase           Bronchitis Acute  15  15 15  21    29 0.32696355 0.9354839
19        Eosinophil Increase           Acute Bronchitis  17  17 17  25    29 0.32696355 0.9354839
20        Eosinophil Increase  Acute Maxillary Sinusitis  21  21 21  34    36 0.36333027 0.9230769
于 2019-11-26T17:13:15.227 回答