- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有文字文件,在每份文件中,我都有以电视剧集剧透为特色的文字。每个文档都是不同的系列。我想比较每个系列中最常用的词,我想我可以使用ggplot对其进行绘图,并在一个轴上具有“至少发生x次的系列1术语”和至少发生x次的“系列2术语” ' 另外一个。我希望我需要的是一个具有3列“术语”,“系列x”,“系列Y”的数据框。 x和y系列具有该单词出现的次数。
我尝试了多种方法来执行此操作,但失败了。我得到的最接近的是我可以阅读语料库,并在所有列中创建一个包含所有术语的数据框,如下所示:
library("tm")
corpus <-Corpus(DirSource("series"))
corpus.p <-tm_map(corpus, removeWords, stopwords("english")) #removes stopwords
corpus.p <-tm_map(corpus.p, stripWhitespace) #removes stopwords
corpus.p <-tm_map(corpus.p, tolower)
corpus.p <-tm_map(corpus.p, removeNumbers)
corpus.p <-tm_map(corpus.p, removePunctuation)
dtm <-DocumentTermMatrix(corpus.p)
docTermMatrix <- inspect(dtm)
termCountFrame <- data.frame(Term = colnames(docTermMatrix))
termCountFrame$seriesX <- colSums(docTermMatrix)
最佳答案
如果您的数据在“文档术语表”中,则可以使用tm::findFreqTerms
来获取文档中最常用的术语。这是一个可重现的示例:
require(tm)
data(crude)
dtm <- DocumentTermMatrix(crude)
dtm
A document-term matrix (20 documents, 1266 terms)
Non-/sparse entries: 2255/23065
Sparsity : 91%
Maximal term length: 17
Weighting : term frequency (tf)
# find most frequent terms in all 20 docs
findFreqTerms(dtm, 2, 100)
# find the doc names
dtm$dimnames$Docs
[1] "127" "144" "191" "194" "211" "236" "237" "242" "246" "248" "273" "349" "352" "353" "368" "489" "502"
[18] "543" "704" "708"
# do freq words on one doc
findFreqTerms(dtm[dtm$dimnames$Docs == "127"], 2, 100)
[1] "crude" "cut" "diamond" "dlrs" "for" "its" "oil" "price"
[9] "prices" "reduction" "said." "that" "the" "today" "weak"
# find freq words for each doc, one by one
list_freqs <- lapply(dtm$dimnames$Docs,
function(i) findFreqTerms(dtm[dtm$dimnames$Docs == i], 2, 100))
list_freqs
[[1]]
[1] "crude" "cut" "diamond" "dlrs" "for" "its" "oil" "price"
[9] "prices" "reduction" "said." "that" "the" "today" "weak"
[[2]]
[2] "\"opec" "\"the" "15.8" "ability" "above" "address" "agreement"
[8] "analysts" "and" "before" "bpd" "but" "buyers" "current"
[15] "demand" "emergency" "energy" "for" "has" "have" "higher"
[22] "hold" "industry" "its" "keep" "market" "may" "meet"
[29] "meeting" "mizrahi" "mln" "must" "next" "not" "now"
[36] "oil" "opec" "organization" "prices" "problem" "production" "said"
[43] "said." "set" "that" "the" "their" "they" "this"
[50] "through" "will"
[[3]]
[3] "canada" "canadian" "crude" "for" "oil" "price" "texaco" "the"
[[4]]
[4] "bbl." "crude" "dlrs" "for" "price" "reduced" "texas" "the" "west"
[[5]]
[5] "and" "discounted" "estimates" "for" "mln" "net" "pct" "present"
[9] "reserves" "revenues" "said" "study" "that" "the" "trust" "value"
[[6]]
[6] "ability" "above" "ali" "and" "are" "barrel."
[7] "because" "below" "bpd" "bpd." "but" "daily"
[13] "difficulties" "dlrs" "dollars" "expected" "for" "had"
[19] "has" "international" "its" "kuwait" "last" "local"
[25] "march" "markets" "meeting" "minister" "mln" "month"
[31] "official" "oil" "opec" "opec\"s" "prices" "producing"
[37] "pumping" "qatar," "quota" "referring" "said" "said."
[43] "sheikh" "such" "than" "that" "the" "their"
[49] "they" "this" "was" "were" "which" "will"
[[7]]
[7] "\"this" "and" "appears" "are" "areas" "bank"
[7] "bankers" "been" "but" "crossroads" "crucial" "economic"
[13] "economy" "embassy" "fall" "for" "general" "government"
[19] "growth" "has" "have" "indonesia\"s" "indonesia," "international"
[25] "its" "last" "measures" "nearing" "new" "oil"
[31] "over" "rate" "reduced" "report" "say" "says"
[37] "says." "sector" "since" "the" "u.s." "was"
[43] "which" "with" "world"
[[8]]
[8] "after" "and" "deposits" "had" "oil" "opec" "pct" "quotes"
[9] "riyal" "said" "the" "were" "yesterday."
[[9]]
[9] "1985/86" "1986/87" "1987/88" "abdul-aziz" "about" "and" "been"
[8] "billion" "budget" "deficit" "expenditure" "fiscal" "for" "government"
[15] "had" "its" "last" "limit" "oil" "projected" "public"
[22] "qatar," "revenue" "riyals" "riyals." "said" "sheikh" "shortfall"
[29] "that" "the" "was" "would" "year" "year's"
[[10]]
[10] "15.8" "about" "above" "accord" "agency" "ali" "among" "and"
[9] "arabia" "are" "dlrs" "for" "free" "its" "kuwait" "market"
[17] "market," "minister," "mln" "nazer" "oil" "opec" "prices" "producing"
[25] "quoted" "recent" "said" "said." "saudi" "sheikh" "spa" "stick"
[33] "that" "the" "they" "under" "was" "which" "with"
[[11]]
[11] "1.2" "and" "appeared" "arabia's" "average" "barrel." "because" "below"
[9] "bpd" "but" "corp" "crude" "december" "dlrs" "export" "exports"
[17] "february" "fell" "for" "four" "from" "gulf" "january" "january,"
[25] "last" "mln" "month" "month," "neutral" "official" "oil" "opec"
[33] "output" "prices" "production" "refinery" "said" "said." "saudi" "sell"
[41] "sources" "than" "the" "they" "throughput" "week" "yanbu" "zone"
[[12]]
[12] "and" "arab" "crude" "emirates" "gulf" "ministers" "official" "oil"
[9] "states" "the" "wam"
[[13]]
[13] "accord" "agency" "and" "arabia" "its" "nazer" "oil" "opec" "prices" "saudi" "the"
[12] "under"
[[14]]
[14] "crude" "daily" "for" "its" "oil" "opec" "pumping" "that" "the" "was"
[[15]]
[15] "after" "closed" "new" "nuclear" "oil" "plant" "port" "power" "said" "ship"
[11] "the" "was" "when"
[[16]]
[16] "about" "and" "development" "exploration" "for" "from" "help"
[8] "its" "mln" "oil" "one" "present" "prices" "research"
[15] "reserve" "said" "strategic" "the" "u.s." "with" "would"
[[17]]
[17] "about" "and" "benefits" "development" "exploration" "for" "from"
[8] "group" "help" "its" "mln" "oil" "one" "policy"
[15] "present" "prices" "protect" "research" "reserve" "said" "strategic"
[22] "study" "such" "the" "u.s." "with" "would"
[[18]]
[18] "1.50" "company" "crude" "dlrs" "for" "its" "lowered" "oil" "posted" "prices"
[11] "said" "said." "the" "union" "west"
[[19]]
[19] "according" "and" "april" "before" "can" "change" "efp"
[8] "energy" "entering" "exchange" "for" "futures" "has" "hold"
[15] "increase" "into" "mckiernan" "new" "not" "nymex" "oil"
[22] "one" "position" "prices" "rule" "said" "spokeswoman." "that"
[29] "the" "traders" "transaction" "when" "will"
[[20]]
[20] "1986," "1987" "billion" "cubic" "fiscales" "january" "mln"
[8] "pct" "petroliferos" "yacimientos"
# from here http://stackoverflow.com/a/7196565/1036500
L <- list_freqs
cfun <- function(L) {
pad.na <- function(x,len) {
c(x,rep(NA,len-length(x)))
}
maxlen <- max(sapply(L,length))
do.call(data.frame,lapply(L,pad.na,len=maxlen))
}
# make dataframe of words (but probably you want words as rownames and cells with counts?)
tab_freqa <- cfun(L)
# convert dtm to matrix
mat <- as.matrix(dtm)
# make data frame similar to "3 columns 'Terms',
# 'Series x', 'Series Y'. With series x and y
# having the number of times that word occurs"
cb <- data.frame(doc1 = mat['127',], doc2 = mat['144',])
# keep only words that are in at least one doc
cb <- cb[rowSums(cb) > 0, ]
# plot
require(ggplot2)
ggplot(cb, aes(doc1, doc2)) +
geom_text(label = rownames(cb),
position=position_jitter())
# this is the typical method to turn a
# dtm into a df...
df <- as.data.frame(as.matrix(dtm))
# and transpose for plotting
df <- data.frame(t(df))
# plot
require(ggplot2)
ggplot(df, aes(X127, X144)) +
geom_text(label = rownames(df),
position=position_jitter())
关于r - 计算R中语料库中单个文档中的单词并将其放入数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17294824/
我尝试使用R,jiebaR和语料库为中文语音生成词云并获取词频,但无法制作语料库。这是我的代码: library(jiebaR) library(stringr) library(corpus) cu
我试图在 R 中做一些词干化,但它似乎只适用于单个文档。我的最终目标是一个术语文档矩阵,它显示文档中每个术语的频率。 下面是一个例子: require(RWeka) require(tm) requi
我一直在利用许多不同的语料库进行自然语言处理,并且我一直在寻找使用 Wordnet Word Senses 注释的语料库。 我知道可能没有一个包含这些信息的大语料库,因为语料库需要手动构建,但必须有一
请,请,请帮助。我有一个文件夹,里面装满了我想使用 NLTK 进行分析的文本文件。我如何将其导入为语料库,然后在其上运行 NLTK 命令?我已经将下面的代码放在一起,但它给了我这个错误: ra
除了nltk自带的语料库之外,我想用自己的遵循相同词性规则的语料库来训练它。如何找到它正在使用的语料库,以及如何添加我自己的语料库(另外,不是作为替代)? 编辑:这是我当前使用的代码: inpy =
我想使用 tweeter_sample 语料库训练 nltk,但当我尝试按类别加载示例时出现错误。 首先我尝试这样: from nltk.corpus import twitter_samples d
我想使用 tweeter_sample 语料库训练 nltk,但当我尝试按类别加载示例时出现错误。 首先我尝试这样: from nltk.corpus import twitter_samples d
我正在尝试对大型文本文件中最常用的词进行排名 - - 爱丽丝梦游仙境(公共(public)领域)。这是爱丽丝梦游仙境 Dropbox和 Pastebin .它按预期运行,有 1818 个“the”实例
我希望对一些本地 Lilypond (.ly) 文件进行语料库研究,但我无法将它们导入本地 music21 语料库。 我只能假设答案在 music21.converter 上页面,但我似乎无法解开它。
有没有办法训练现有的 Apache OpenNLP POS Tagger 模型?我需要为特定于我的应用程序的模型添加更多专有名词。当我尝试使用以下命令时: opennlp POSTaggerTrain
我需要从一个巨大的数据帧(或任何与 r 数据帧等效的 python)创建一个语料库,方法是将它分成与用户名一样多的数据帧。 例如,我从这样的数据框开始: username search_term
我已经下载了 BLLIP语料库并想将其导入 NLTK。问题的答案中描述了我发现的一种方法 How to read corpus of parsed sentences using NLTK in py
假设我有以下内容: x10 = data.frame(id = c(1,2,3),vars =c('top','down','top'), text1=c('this is text','s
我想使用 R 的分布式计算 tm 包(称为 tm.plugin.dc)制作一个包含 1 亿条推文的文本语料库。这些推文存储在我笔记本电脑上的一个大型 MySQL 表中。我的笔记本电脑很旧,所以我使用的
我的项目使用NLTK。如何列出项目的语料库和模型要求以便自动安装它们?我不想点击 nltk.download() GUI,一一安装软件包。 此外,有什么方法可以卡住相同的需求列表(例如pip free
如何在pytorrch中读入.txt文件(语料库)到torchtext? 我只看到 data.Dataset 的示例数据集和 data.TabularData 的 csv、json 和 tsv。 ht
我已经下载了 Conll 2003 语料库(“eng.train”)。我想用它来使用 python crfsuite 训练来提取实体。但我不知道如何加载这个文件进行训练。 我找到了这个示例,但它不适用
我一直在尝试为特定领域和新实体训练命名实体识别模型。似乎没有一个完整的适合此的管道,并且需要使用不同的包。 我想给NLTK一个机会。我的问题是,如何训练 NLTK NER 使用 ieer 语料库对新实
使用 JupyterLab 打开我的 EMR 集群后。我无法使用 nltk.download() 下载额外的语料库。 代码 nltk.download('wordnet') 错误 I/O operat
有没有人建议在哪里可以找到用于小型语料库的日常英语文本的文件或集合?我一直在使用 Gutenberg Project 书籍作为工作原型(prototype),并希望融入更多现代语言。一个 recent
我是一名优秀的程序员,十分优秀!