r - 语料库参数上的 DocumentTermMatrix 错误-6ren

r - 语料库参数上的 DocumentTermMatrix 错误

转载作者：行者123 更新时间：2023-12-03 05:48:54

24

4

我有以下代码:

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.

corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)

news_dtm <- DocumentTermMatrix(corpus_clean) # errors here

当我运行 DocumentTermMatrix() 方法时，它给出了以下错误:

Error: inherits(doc, "TextDocument") is not TRUE

为什么我会收到此错误？我的行不是文本文档吗？

这是检查 corpus_clean 时的输出:

[[153]]
[1] obama holds technical school model us

[[154]]
[1] oil boom produces jobs bonanza archaeologists

[[155]]
[1] islamic terrorist group expands territory captures tikrit

[[156]]
[1] republicans democrats feel eric cantors loss

[[157]]
[1] tea party candidates try build cantor loss

[[158]]
[1] vehicles materials stored delaware bridges

[[159]]
[1] hill testimony hagel defends bergdahl trade

[[160]]
[1] tweet selfpropagates tweetdeck

[[161]]
[1] blackwater guards face trial iraq shootings

[[162]]
[1] calif man among soldiers killed afghanistan

[[163]]
[1] stocks fall back world bank cuts growth outlook

[[164]]
[1] jabhat alnusra longer useful turkey

[[165]]
[1] catholic bishops keep focus abortion marriage

[[166]]
[1] barbra streisand visits hill heart disease

[[167]]
[1] rand paul cantors loss reason stop talking immigration

[[168]]
[1] israeli airstrike kills northern gaza

编辑:这是我的数据:

type,text
neutral,The week in 32 photos
neutral,Look at me! 22 selfies of the week
neutral,Inside rebel tunnels in Homs
neutral,Voices from Ukraine
neutral,Water dries up ahead of World Cup
positive,Who's your hero? Nominate them
neutral,Anderson Cooper: Here's how
positive,"At fire scene, she rescues the pet"
neutral,Hunger in the land of plenty
positive,Helping women escape 'the life'
neutral,A tour of the sex underworld
neutral,Miss Universe Thailand steps down
neutral,China's 'naked officials' crackdown
negative,More held over Pakistan stoning
neutral,Watch landmark Cold War series
neutral,In photos: History of the Cold War
neutral,Turtle predicts World Cup winner
neutral,What devoured great white?
positive,Nun wins Italy's 'The Voice'
neutral,Bride Price app sparks debate
neutral,China to deport 'pork' artist
negative,Lightning hits moving car
neutral,Singer won't be silenced
neutral,Poland's mini desert
neutral,When monarchs retire
negative,Murder on Street View?
positive,Meet armless table tennis champ
neutral,Incredible 400 year-old globes
positive,Man saves falling baby
neutral,World's most controversial foods

我检索的内容如下:

news_raw <- read.csv('news_csv.csv', stringsAsFactors = F)

编辑:这是回溯():

> news_dtm <- DocumentTermMatrix(corpus_clean)
Error: inherits(doc, "TextDocument") is not TRUE
> traceback()
9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"), 
       ch), call. = FALSE, domain = NA)
8: stopifnot(inherits(doc, "TextDocument"), is.list(control))
7: FUN(X[[1L]], ...)
6: lapply(X, FUN, ...)
5: mclapply(unname(content(x)), termFreq, control)
4: TermDocumentMatrix.VCorpus(x, control)
3: TermDocumentMatrix(x, control)
2: t(TermDocumentMatrix(x, control))
1: DocumentTermMatrix(corpus_clean)

当我评估inherits(corpus_clean, "TextDocument")时，它是FALSE。

最佳答案

看起来这在 tm 0.5.10 中工作得很好，但 tm 0.6.0 中的更改似乎破坏了它。问题是函数 tolower 和 trim 不一定会返回 TextDocuments(看起来旧版本可能已经自动完成了转换)。相反，它们返回字符，并且 DocumentTermMatrix 不确定如何处理字符语料库。

所以你可以改为

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

或者你可以运行

corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

在所有非标准转换(不在 getTransformations() 中的转换)完成之后且在创建 DocumentTermMatrix 之前。这应该确保您的所有数据都在 PlainTextDocument 中，并且应该让 DocumentTermMatrix 满意。

关于r - 语料库参数上的 DocumentTermMatrix 错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24191728/

24

4

0

文章推荐： x11 - 如何指定显示？

文章推荐： php - WP Cron作业正在运行，但未执行全部功能

文章推荐： javascript - jQuery 背景图像在计时器上变化

r - txt : corpus() only works on character, 语料库、语料库、data.frame、kwic对象读取中文出错
我尝试使用R，jiebaR和语料库为中文语音生成词云并获取词频，但无法制作语料库。这是我的代码: library(jiebaR) library(stringr) library(corpus) cu
R 提取字符串/文档/语料库
我试图在 R 中做一些词干化，但它似乎只适用于单个文档。我的最终目标是一个术语文档矩阵，它显示文档中每个术语的频率。下面是一个例子: require(RWeka) require(tm) requi
nlp - Wordnet(词义注释)语料库
我一直在利用许多不同的语料库进行自然语言处理，并且我一直在寻找使用 Wordnet Word Senses 注释的语料库。我知道可能没有一个包含这些信息的大语料库，因为语料库需要手动构建，但必须有一
python - 导入和使用 NLTK 语料库
请，请，请帮助。我有一个文件夹，里面装满了我想使用 NLTK 进行分析的文本文件。我如何将其导入为语料库，然后在其上运行 NLTK 命令？我已经将下面的代码放在一起，但它给了我这个错误: ra
python - 编辑 NLTK 语料库
除了nltk自带的语料库之外，我想用自己的遵循相同词性规则的语料库来训练它。如何找到它正在使用的语料库，以及如何添加我自己的语料库(另外，不是作为替代)？编辑:这是我当前使用的代码: inpy =
python - nltk 语料库 tweeter_sample 按类别
我想使用 tweeter_sample 语料库训练 nltk，但当我尝试按类别加载示例时出现错误。首先我尝试这样: from nltk.corpus import twitter_samples d
python - nltk 语料库 tweeter_sample 按类别
我想使用 tweeter_sample 语料库训练 nltk，但当我尝试按类别加载示例时出现错误。首先我尝试这样: from nltk.corpus import twitter_samples d
python - 从大文本文件中过滤停用词(使用包 : nltk. 语料库)
我正在尝试对大型文本文件中最常用的词进行排名 - - 爱丽丝梦游仙境(公共(public)领域)。这是爱丽丝梦游仙境 Dropbox和 Pastebin .它按预期运行，有 1818 个“the”实例
python - 将 Lilypond 文件导入本地 music21 语料库
我希望对一些本地 Lilypond (.ly) 文件进行语料库研究，但我无法将它们导入本地 music21 语料库。我只能假设答案在 music21.converter 上页面，但我似乎无法解开它。
nlp - 是否可以将单词附加到现有的 OpenNLP POS 语料库/模型？
有没有办法训练现有的 Apache OpenNLP POS Tagger 模型？我需要为特定于我的应用程序的模型添加更多专有名词。当我尝试使用以下命令时: opennlp POSTaggerTrain
python - 在 python 中创建一个 "virtual"语料库
我需要从一个巨大的数据帧(或任何与 r 数据帧等效的 python)创建一个语料库，方法是将它分成与用户名一样多的数据帧。例如，我从这样的数据框开始: username search_term
python - 使用 NLTK 导入外部树库式 BLLIP 语料库
我已经下载了 BLLIP语料库并想将其导入 NLTK。问题的答案中描述了我发现的一种方法 How to read corpus of parsed sentences using NLTK in py
r - 如何从具有多列文本的 data.frame 创建 quanteda 语料库？
假设我有以下内容: x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'), text1=c('this is text','s
r - 如何制作 1 亿条推文的 R tm 语料库？
我想使用 R 的分布式计算 tm 包(称为 tm.plugin.dc)制作一个包含 1 亿条推文的文本语料库。这些推文存储在我笔记本电脑上的一个大型 MySQL 表中。我的笔记本电脑很旧，所以我使用的
installation - 以编程方式安装 NLTK 语料库/模型，即无需 GUI 下载器？
我的项目使用NLTK。如何列出项目的语料库和模型要求以便自动安装它们？我不想点击 nltk.download() GUI，一一安装软件包。此外，有什么方法可以卡住相同的需求列表(例如pip free
pytorch - 如何将 .txt 文件(语料库)读入 pytorch 中的 torchtext？
如何在pytorrch中读入.txt文件(语料库)到torchtext？我只看到 data.Dataset 的示例数据集和 data.TabularData 的 csv、json 和 tsv。 ht
machine-learning - 如何在 python crfsuite 中使用 Conll 2003 语料库
我已经下载了 Conll 2003 语料库(“eng.train”)。我想用它来使用 python crfsuite 训练来提取实体。但我不知道如何加载这个文件进行训练。我找到了这个示例，但它不适用
python - 使用 NLTK ieer 或 conll2000 语料库训练 NER 语料库
我一直在尝试为特定领域和新实体训练命名实体识别模型。似乎没有一个完整的适合此的管道，并且需要使用不同的包。我想给NLTK一个机会。我的问题是，如何训练 NLTK NER 使用 ieer 语料库对新实
python - 无法在 AWS EMR 上下载 nltk 语料库，对已关闭文件进行 I/O 操作
使用 JupyterLab 打开我的 EMR 集群后。我无法使用 nltk.download() 下载额外的语料库。代码 nltk.download('wordnet') 错误 I/O operat
NLP:构建(小型)语料库，或 "Where to get lots of not-too-specialized English-language text files?"
有没有人建议在哪里可以找到用于小型语料库的日常英语文本的文件或集合？我一直在使用 Gutenberg Project 书籍作为工作原型(prototype)，并希望融入更多现代语言。一个 recent

首页

博学

6Ren·AI

商城

r - 语料库参数上的 DocumentTermMatrix 错误