我有一个 tibble 表(mydf)(100 行 x 5 列)。文章由许多段落组成。
ID<-c(1,2)
Date<-c("31/01/2018","15/02/2018")
article1<-c("This is the first article. It is not long. It is not short. It
comprises of many words and many sentences. This ends paragraph one.
Parapraph two starts here. It is just a continuation.")
article2<-c("This is the second article. It is longer than first article by
number of words. It also does not communicate anyything of value. Reading it
can put you to sleep or jumpstart your imagination. Let your imagination
take you to some magical place. Enjoy the ride.")
Articles<-c(article1,article2)
FirstWord<-c("first","starts")
SecondWord<-c("jumpstart","magical")
mydf<-tibble(ID,Date, FirstWord,SecondWord,Articles)
ID Date FirstWord SecondWord Articles
1 xxxx xxx xxx xxx
2 etc
3 etc
我想向表中添加新列,如果FirstWord之间的距离接近Article 中的SecondWord 30 个字空间,这给了我TRUE/FALSE 。
ID Date FirstWord SecondWord Articles distance
1 xxxx xxx xxx xxx TRUE
2 etc FALSE
3 etc
我已经按照 StackOverflow 中的这个示例来计算距离 -如何计算单词与文档中特定术语的接近度
library(tidytext)
library(dplyr)
all_words <- mydf %>%
unnest_tokens(word, Articles) %>%
mutate(position = row_number())
library(fuzzyjoin)
nearby_words <- all_words %>%
filter(word == FirstWord) %>%
select(focus_term = word, focus_position = position) %>%
difference_inner_join(all_words, by = c(focus_position = "position"), max_dist = 30) %>%
mutate(distance = abs(focus_position - position))
我得到这样的表:
focus_term focus_position ID Date FirstWord SecondWord word position
如何获得这种格式的结果:
ID Date FirstWord SecondWord Articles distance
1 xxxx xxx xxx xxx TRUE
2 etc FALSE
3 etc
感谢你的帮助 :)