r - 在 R 中的许多文章中找到相近的词

Question

我有一个 tibble 表（mydf）（100 行 x 5 列）。文章由许多段落组成。

ID<-c(1,2)
Date<-c("31/01/2018","15/02/2018") 

article1<-c("This is the first article. It is not long. It is not short. It 
comprises of many words and many sentences. This ends paragraph one.  
Parapraph two starts here. It is just a continuation.")

article2<-c("This is the second article. It is longer than first article by 
number of words. It also does not communicate anyything of value. Reading it 
can put you to sleep or jumpstart your imagination. Let your imagination 
take you to some magical place. Enjoy the ride.")

Articles<-c(article1,article2)

FirstWord<-c("first","starts")
SecondWord<-c("jumpstart","magical")

mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)

ID    Date    FirstWord    SecondWord    Articles
 1    xxxx     xxx           xxx          xxx
 2     etc
 3     etc

我想向表中添加新列，如果FirstWord之间的距离接近Article 中的SecondWord 30 个字空间，这给了我TRUE/FALSE 。

ID    Date    FirstWord    SecondWord    Articles   distance
 1    xxxx     xxx           xxx          xxx        TRUE
 2     etc                                           FALSE
 3     etc

我已经按照 StackOverflow 中的这个示例来计算距离 -如何计算单词与文档中特定术语的接近度

library(tidytext)
library(dplyr)

all_words <- mydf %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number()) 

library(fuzzyjoin)

nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position))

我得到这样的表：

  focus_term   focus_position  ID    Date    FirstWord    SecondWord   word  position

如何获得这种格式的结果：

ID    Date    FirstWord    SecondWord    Articles   distance
 1    xxxx     xxx           xxx          xxx        TRUE
 2     etc                                           FALSE
 3     etc

感谢你的帮助：）

score 2 · Accepted Answer

由于您正在对 Article 列进行标记，因此我们将其转换为 words 列，为了获得原始 Article 列，只需在标记之前将其更改为新列（比如说 new_column）。在 near_words 中，我刚刚在输出中选择了您想要的列。此外，如果距离等于 30，我还添加了布尔值。

mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
        all_words <- mydf %>%
          mutate(new_column=Articles) %>%
          unnest_tokens(word, Articles) %>%
          mutate(position = row_number())

    nearby_words <- all_words %>%
      filter(word == FirstWord) %>%
      select(focus_term = word, focus_position = position) %>%
      difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
     mutate(distance = abs(focus_position - position)) %>%
     mutate(distance=ifelse(distance == 30,yes=TRUE,no=FALSE)) %>%
     select(ID,Date,FirstWord,SecondWord,new_column,distance)

r - 在 R 中的许多文章中找到相近的词

1 回答 1

Related

Reference