gpt4 book ai didi

r - 如何仅选择语料库术语的子集以在 tm 中创建 TermDocumentMatrix

转载 作者:行者123 更新时间:2023-12-02 09:35:21 25 4
gpt4 key购买 nike

我有一个巨大的语料库,我只对我预先知道的少数术语的出现感兴趣。有没有办法使用 tm 包从语料库创建术语文档矩阵,其中仅使用和包含我预先指定的术语?

我知道我可以对语料库的结果 TermDocumentMatrix 进行子集化,但由于内存大小限制,我想避免从构建完整的术语文档矩阵开始。

最佳答案

您可以通过构建自定义转换函数来修改语料库以仅保留所需的术语。请参阅Vignette for the tm package以及 content_transformer 函数的帮助以获取更多信息:

library(tm)

# Create a corpus from the text listed below
corp = VCorpus(VectorSource(doc))

# Custom function to keep only the terms in "pattern" and remove everything else
(f <- content_transformer(function(x, pattern)
regmatches(x, gregexpr(pattern, x, perl=TRUE, ignore.case=TRUE))))

(仅供引用,上面的第二行代码改编自 this SO answer 。)

# The pattern we'll search for
keep = "sleep|dream|die"

# Run the transformation function using the pattern above
tm_map(corp, f, keep)[[1]]

这是运行转换函数的结果:

<<PlainTextDocument (metadata: 7)>>
c("die", "sleep", "sleep", "die", "sleep", "sleep", "Dream")

这是我用来创建语料库的原始文本:

doc = "To be, or not to be, that is the question—
Whether 'tis Nobler in the mind to suffer
The Slings and Arrows of outrageous Fortune,
Or to take Arms against a Sea of troubles,
And by opposing, end them? To die, to sleep—
No more; and by a sleep, to say we end
The Heart-ache, and the thousand Natural shocks
That Flesh is heir to? 'Tis a consummation
Devoutly to be wished. To die, to sleep,
To sleep, perchance to Dream; Aye, there's the rub"

关于r - 如何仅选择语料库术语的子集以在 tm 中创建 TermDocumentMatrix,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27008306/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com