r - 如何使用 tm_map 将元数据添加到 tm Corpus 对象-6ren

r - 如何使用 tm_map 将元数据添加到 tm Corpus 对象

转载作者：行者123 更新时间：2023-12-05 01:23:39

32

4

我一直在阅读不同的问题/答案(尤其是 here 和 here)，但没有设法将任何问题/答案应用于我的情况。

我有一个 11,390 行的矩阵，其中包含属性 id、作者、文本，例如:

library(tm)

m <- cbind(c("01","02","03","04","05","06"),
           c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
           c("Text1","Text2","Text3","Text4","Text5","Text6"))

我想用它创建一个 tm 语料库。我可以快速创建我的语料库

tm_corpus <- Corpus(VectorSource(m[,3]))

它终止了我的 11,390 行矩阵的执行

   user  system elapsed 
  2.383   0.175   2.557

但是当我尝试将元数据添加到语料库时

meta(tm_corpus, type="local", tag="Author") <- m[,2]

执行时间超过了 15 分钟并且还在继续(然后我停止了执行)。

根据讨论here使用 tm_map 可以显着减少处理语料库的时间；像

tm_corpus <- tm_map(tm_corpus, addMeta, m[,2])

我仍然不确定该怎么做。可能会是这样的

addMeta <- function(text, vector) {
  meta(text, tag="Author") = vector[??]
  text
}

一方面，如何将要分配给语料库每个文本的值向量传递给 tm_map？我应该从循环中调用该函数吗？我应该在 vapply 中包含 tm_map 函数吗？

最佳答案

您是否已经尝试过出色的 readTabular ？

## your sample data
matrix <- cbind(c("01","02","03","04","05","06"),
       c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
       c("Text1","Text2","Text3","Text4","Text5","Text6"))

## simple transformations
matrix <- as.data.frame(matrix)
names(matrix) <- c("id", "author", "content")

现在您的 ex-matrix now data.frame 可以使用 readTabular 轻松读取为语料库。 ReadTabular 希望您定义一个 Reader，它本身采用映射。在您的映射中，“内容”指向文本数据和其他名称 - 好吧 - 元数据。

## define myReader, which will be used in creation of Corpus
myReader <- readTabular(mapping=list(id="id", author="author", content="content"))

现在语料库的创建和以前一样，除了一些小的变化:

## create the corpus
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus,
    readerControl = list(reader=myReader))

现在看看第一个项目的内容和元数据:

lapply(tm_corpus, as.character)
lapply(tm_corpus, meta)
## output just as expected.

这应该很快，因为它是包的一部分并且适应性极强。在我自己的项目中，我在带有大约 20 个变量的 data.table 上使用它 - 它就像一个魅力。

但是，我无法提供您已经认可为合适的答案的基准测试。我只是猜测它更快、更高效。

关于r - 如何使用 tm_map 将元数据添加到 tm Corpus 对象，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21036032/

32

4

0

文章推荐： ios4 - map 方向

文章推荐： Bitbucket 管道 - 安装预提交 ts-lint 时出错

文章推荐： prisma - 错误 : Type 'number' is not assignable to type 'Decimal'

文章推荐： trie - 搜索丢失字母的算法

nlp - 这是 "corpus"的正确定义吗？
已关闭。这个问题是 off-topic 。目前不接受答案。想要改进这个问题吗？ Update the question所以它是on-topic用于堆栈溢出。已关闭12 年前。 Improve th
macos - 导入NLTK : no module NLTK corpus
我已经安装了NLTK。这是安装日志的图像。当我使用import nltk时，我收到错误: "No module named NLTK.corpus" 这是屏幕截图。可能是什么原因？最佳答案我想
python - 我应该下载哪个语料库来访问 nltk.corpus.words？
我的代码引用了 nltk.corpus.words.words()。哪个NLTK data设置为我需要下载才能访问它？现在我告诉下载器通过执行 python -m nltk.downloader
python - 如何使用 Corpus.slice 添加带有日期的列
我真的是编程新手，这就是为什么我的问题可能很无聊或愚蠢，对此感到抱歉!我正在尝试在 Gephi 中构建共同作者图(graphml 格式)。一切都很好，但我不明白如何在同一文件中导入日期。我的代码如下:
python - nltk.corpus.wordnet 的哪个相似度函数适用于查找两个词的相似度？
nltk.corpus.wordnet 中哪个相似度函数适用于查找两个词的相似度？ path_similarity()? lch_similarity()? wup_similari
se.lth.cs.srl.corpus.Yield类的使用及代码示例
本文整理了Java中se.lth.cs.srl.corpus.Yield类的一些代码示例，展示了Yield类的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台
corpus - 为什么使用 Europarl 的 Moses 表现如此糟糕？
我已经开始研究 Moses，并尝试制作我认为会是相当标准的基线系统。 the steps described on the website我基本都关注了，但我没有使用 news-commentary，
python - 自然语言处理 : text corpus format for word2vec
我找到了一个在大型维基百科数据集上使用 word2vec 的教程 http://danielfrg.github.io/blog/2013/09/21/word2vec-yhat/ 我想构建一个类似于
python - 单词A和B的语义相似度 : Dependency on frequency of A and B in corpus?
背景: 给定一个语料库，我想使用 word2wec (Gensim) 的实现来训练它。想要了解 2 个标记之间的最终相似性是否取决于语料库中 A 和 B 的频率(保留所有上下文)，还是不可知。示例
python - 已标记 nltk.corpus.nps_chat.xml_post
您好，我正在使用 NLTK、nps_chat 语料库。我知道我可以像下面这样访问 nps 聊天语料库 posts = nltk.corpus.nps_chat.xml_posts() 我准备了Lab
r - 如何使用 tm_map 将元数据添加到 tm Corpus 对象
我一直在阅读不同的问题/答案(尤其是 here 和 here)，但没有设法将任何问题/答案应用于我的情况。我有一个 11,390 行的矩阵，其中包含属性 id、作者、文本，例如: library(t
R文本挖掘包: Allowing to incorporate new documents into an existing corpus
我想知道 R 的文本挖掘包是否有可能具有以下功能: myCorpus ),control=...) # add docs myCorpus.addDocs(DirSource(),control=..
R tm包vcorpus : Error in converting corpus to data frame
我正在使用 tm 包通过以下代码清理一些数据: mycorpus corpus")" to a data.frame 如何将语料库转换为数据框？最佳答案你的语料库实际上只是一个带有一些额外属性的
tm - R : find most frequent group of words in corpus
有没有一种简单的方法如何不仅可以找到最常用的术语，还可以在 R 的文本语料库中找到表达式(所以不止一个单词，单词组)？使用 tm 包，我可以找到最常见的术语，如下所示: tdm <- TermDoc
python - 如何在 Python 中使用 nltk.corpus 逐行读取和标记文本文件
我的问题是在给定两个训练数据 good_reviews.txt 和 bad_reviews.txt 的情况下对文档进行分类。因此，首先我需要加载并标记我的训练数据，其中每一行本身就是一个文档，对应于评
python - 使用 nltk.corpus.gutenberg.fileids() 解码路径中的错误
当我使用 Python 2.7(Anaconda、Windows)运行 nltk.corpus.gutenberg.fileids() 时，出现以下错误: File "C:\Anaconda\lib\
java - 如何使用 Genia Corpus 训练 Stanford Parser？
我在为斯坦福解析器创建新模型时遇到了一些问题。我还从斯坦福下载了最新版本: http://nlp.stanford.edu/software/lex-parser.shtml 这里，Genia Co
se.lth.cs.srl.corpus.Yield.contains()方法的使用及代码示例
本文整理了Java中se.lth.cs.srl.corpus.Yield.contains()方法的一些代码示例，展示了Yield.contains()的具体用法。这些代码示例主要来源于Github/
se.lth.cs.srl.corpus.Yield.first()方法的使用及代码示例
本文整理了Java中se.lth.cs.srl.corpus.Yield.first()方法的一些代码示例，展示了Yield.first()的具体用法。这些代码示例主要来源于Github/Stacko
se.lth.cs.srl.corpus.Yield.size()方法的使用及代码示例
本文整理了Java中se.lth.cs.srl.corpus.Yield.size()方法的一些代码示例，展示了Yield.size()的具体用法。这些代码示例主要来源于Github/Stackove

首页

博学

6Ren·AI

商城

r - 如何使用 tm_map 将元数据添加到 tm Corpus 对象