4

高性能任务视图说明tm可以使用雪进行并行文本挖掘(High-Performance and Parallel Computing with R)。但是,我没有找到任何示例来说明如何做到这一点,尽管我发现了一些关于并行计算的讨论tmR/Finance 2012)。任何人都可以阐明如何tm与由创建的集群接口snow

编辑:见下面 BenBarnes 的评论。具体来说:

根据?tm_startCluster,该函数查找 MPI 集群(不是 SOCK 集群)和“允许 [s] 'tm' 使用集群”。也许这将是 hadoop 的替代方案,因为在给定一些先决条件的情况下,snow可以设置 MPI 集群。

4

1 回答 1

1

使用“r-project tm parallel”作为搜索策略的 LMGTFY 将其作为第三次命中:

使用 tm 进行分布式文本挖掘

直接从幻灯片中复制: 解决方案: 1. 分布式存储 复制到 DFS 的数据集('DistributedCorpus') 只有关于语料库的元信息保留在内存中 2. 并行计算 并行 MapReduce 范式中所有元素的计算操作(Map) 工作马 tm_map () 和 TermDocumentMatrix() 可以按需检索已处理的文档(修订)。

在 tm 的“插件”包中实现:tm.plugin.dc。

#Distributed Text Mining in R 
> library("tm.plugin.dc") 
> dc <- DistributedCorpus(DirSource("Data/reuters"), 
                          list(reader = readReut21578XML) ) 
> dc <- as.DistributedCorpus(Reuters21578) 
> summary(dc) 
#A corpus with 21578 text documents 
#The metadata consists of 2 tag-value pairs and a data frame 
#Available tags are: 
#create_date creator 
#Available variables in the data frame are: 
#MetaID 
--- Distributed Corpus --- 
#Available revisions: 
#20100417144823 
#Active revision: 20100417144823 
#DistributedCorpus: Storage 
#- Description: Local Disk Storage 
#- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2 
#- Current chunk size [bytes]: 10485760 
> dc <- tm_map(dc, stemDocument)
> print(object.size(Reuters21578), units = "Mb") 
#109.5 Mb 
> dc 
#A corpus with 21578 text documents 
> dc_storage(dc) 
DistributedCorpus: Storage 
- Description: Local Disk Storage 
- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2 
- Current chunk size [bytes]: 10485760 
> dc[[3]] 
#----------
Texas Commerce Bancshares Inc 
' 
s Texas 
Commerce Bank-Houston said it filed an application with the 
Comptroller of the Currency in an effort to create the largest 
banking network in Harris County. 
The bank said the network would link 31 banks having 
13.5 billion dlrs in assets and 7.5 billion dlrs in deposits. 
Reuter 
#---------
> print(object.size(dc), units = "Mb") 
# 0.6 Mb

使用以下术语进行进一步搜索: tm, snow ,parLapply ...生成此链接:

使用此代码:

library(snow)
cl <- makeCluster(4, type="SOCK")

par(ask=TRUE)

bigsleep <- function(sleeptime, mat) Sys.sleep(sleeptime)
bigmatrix <- matrix(0, 2000, 2000)
sleeptime <- rep(1, 100)

tm <- snow.time(clusterApply(cl, sleeptime, bigsleep, bigmatrix))
plot(tm)
cat(sprintf("Elapsed time for clusterApply: %f\n", tm$elapsed))

tm <- snow.time(parLapply(cl, sleeptime, bigsleep, bigmatrix))
plot(tm)
cat(sprintf("Elapsed time for parLapply: %f\n", tm$elapsed))

stopCluster(cl)
于 2012-06-19T00:17:06.380 回答