gpt4 book ai didi

r - tm 如何与雪互动?

转载 作者:行者123 更新时间:2023-12-01 05:31:09 28 4
gpt4 key购买 nike

高性能任务 View 指出 tm可以使用 snow 进行并行文本挖掘 (High-Performance and Parallel Computing with R)。然而,我没有找到任何例子来说明如何做到这一点,尽管我发现了一些关于并行计算的讨论 tm (R/Finance 2012)。谁能解释一下 tmsnow 创建的集群的接口(interface)?

编辑:见下面 BenBarnes 的评论。具体来说:

According to ?tm_startCluster, that function looks for an MPI cluster (not a SOCK cluster) and "allow[s] 'tm' to use a cluster". Perhaps that would be an alternative to hadoop, since, given a few prerequisites, snow can set up an MPI cluster.

最佳答案

使用“r-project tm parallel”作为搜索策略的 LMGTFY 将其作为第三次命中:

Distributed Text Mining with tm

直接从幻灯片复制:
解决方案:
1.分布式存储
复制到 DFS 的数据集(“分布式语料库”)
只有关于语料库的元信息保留在内存中
2.并行计算
并行对所有元素进行计算操作 (Map)
MapReduce 范式
工作马 tm_map() 和 TermDocumentMatrix()
可以按需检索已处理的文档(修订)。

在 tm 的“插件”包中实现:tm.plugin.dc。

#Distributed Text Mining in R 
> library("tm.plugin.dc")
> dc <- DistributedCorpus(DirSource("Data/reuters"),
list(reader = readReut21578XML) )
> dc <- as.DistributedCorpus(Reuters21578)
> summary(dc)
#A corpus with 21578 text documents
#The metadata consists of 2 tag-value pairs and a data frame
#Available tags are:
#create_date creator
#Available variables in the data frame are:
#MetaID
--- Distributed Corpus ---
#Available revisions:
#20100417144823
#Active revision: 20100417144823
#DistributedCorpus: Storage
#- Description: Local Disk Storage
#- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2
#- Current chunk size [bytes]: 10485760
> dc <- tm_map(dc, stemDocument)
> print(object.size(Reuters21578), units = "Mb")
#109.5 Mb
> dc
#A corpus with 21578 text documents
> dc_storage(dc)
DistributedCorpus: Storage
- Description: Local Disk Storage
- Base directory on storage: /tmp/RtmpuxX3W7/file5bd062c2
- Current chunk size [bytes]: 10485760
> dc[[3]]
#----------
Texas Commerce Bancshares Inc
'
s Texas
Commerce Bank-Houston said it filed an application with the
Comptroller of the Currency in an effort to create the largest
banking network in Harris County.
The bank said the network would link 31 banks having
13.5 billion dlrs in assets and 7.5 billion dlrs in deposits.
Reuter
#---------
> print(object.size(dc), units = "Mb")
# 0.6 Mb

使用以下术语进行进一步搜索:tm, snow ,parLapply ... produces this link:

使用此代码:
library(snow)
cl <- makeCluster(4, type="SOCK")

par(ask=TRUE)

bigsleep <- function(sleeptime, mat) Sys.sleep(sleeptime)
bigmatrix <- matrix(0, 2000, 2000)
sleeptime <- rep(1, 100)

tm <- snow.time(clusterApply(cl, sleeptime, bigsleep, bigmatrix))
plot(tm)
cat(sprintf("Elapsed time for clusterApply: %f\n", tm$elapsed))

tm <- snow.time(parLapply(cl, sleeptime, bigsleep, bigmatrix))
plot(tm)
cat(sprintf("Elapsed time for parLapply: %f\n", tm$elapsed))

stopCluster(cl)

关于r - tm 如何与雪互动?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11092621/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com