json - 推文之间的 Jaccard 距离

Question

我目前正在尝试测量数据集中推文之间的 Jaccard 距离

这是数据集所在的位置

http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json

我尝试了一些方法来测量距离

这是我到目前为止所拥有的

我将链接的数据集保存到一个名为 Tweets.json 的文件中

json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))

然后我将 json_alldata 转换为 tweet.features 并去掉了 geo 列

# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL

这些是前两条推文的样子

tweet.features$text[1]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"

我尝试的第一件事是使用stringdiststringdist 库下的方法

install.packages("stringdist")
library(stringdist)

#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")

当我运行它时，我得到

[1] 0.1621622

不过，我不确定这是否正确。A 交叉点 B = 23，A 联合 B = 25。Jaccard 距离是 A 交叉点 B/A 联合 B - 对吧？那么根据我的计算，Jaccard 距离应该是 0.92？

所以我想我可以按组来做。只需计算交集和并集并除以

这是我尝试过的

# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])

当我尝试做交集时，我得到了这个：输出只是 list()

 Intersection <- intersect(A1, A2)
 list()

当我尝试联合时，我得到了这个：

工会（A1，A2）

[[1]]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"

[[2]]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"

这似乎没有将单词组合成一个集合。

我想我可以通过联合来划分交叉点。但我想我需要程序来计算每组中的数字或单词，然后进行计算。

不用说，我有点卡住了，我不确定我是否走在正确的轨道上。

任何帮助，将不胜感激。谢谢你。

score 3 · Accepted Answer

intersect并union期望向量（as.set不存在）。我认为您想比较单词以便可以使用strsplit，但拆分的方式属于您。下面的一个例子：

tweet.features <- list(tweet1="RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
                       tweet2=          "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")

jaccard_i <- function(tw1, tw2){
  tw1 <- unlist(strsplit(tw1, " |\\."))
  tw2 <- unlist(strsplit(tw2, " |\\."))
  i <- length(intersect(tw1, tw2))
  u <- length(union(tw1, tw2))
  list(i=i, u=u, j=i/u)
}

jaccard_i(tweet.features[[1]], tweet.features[[2]])

$i
[1] 20

$u
[1] 23

$j
[1] 0.8695652

这是你想要的吗？

这里strsplit是为每个空格或点完成的。您可能想要细化split参数strsplit并替换" |\\."为更具体的内容（请参阅参考资料?regex）。

json - 推文之间的 Jaccard 距离

1 回答 1

Related

Reference