gpt4 book ai didi

r - 在 R 中使用 quanteda 的 2 个单词短语搭配

转载 作者:行者123 更新时间:2023-12-05 07:37:33 25 4
gpt4 key购买 nike

这是关于 R 中 quanteda 包中的 textstat_collocations 功能。我在输出中得到了超过 2 个单词短语,即使我只请求 2 个单词短语。

必要的处理步骤如下(corpus1已经使用corpus函数创建):

collocations_two_words <- textstat_collocations(corpus1, method = "lambda", size = 2, min_count = 5, smoothing = 0.5, tolower = TRUE)

collocations_two_words <- collocations_two_words[collocations_two_words$count >= 10,]

tokens1 <- tokens(tolower(corpus1), what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_symbols = TRUE, remove_separators = TRUE, remove_url = TRUE, remove_hyphens = TRUE)

tokens1 <- tokens_remove(tokens1, stopwords("english"), padding = TRUE)

tokens2 <- tokens_compound(tokens1, pattern = collocations_two_words)

quantdfm <- dfm(tokens2, remove_punct = TRUE, remove_numbers = TRUE)

quantdfm <- dfm_trim(quantdfm, min_count = 5, min_docfreq = 5, verbose = TRUE)

当我检查 quantdfm 对象(使用 tail(quantdfm))时,我得到了 2 个以上的短语。有人可以指导我哪里可能出错吗?

示例输出如下所示: 文档 choosing_dark_chocolate_can eat_dark_chocolate text43979 0 0 text43980 0 0 text43981 0 0 text43982 0 0 text43983 0 0 text43984 0 0

Output of dput(head(corpus1,5)):
structure(list(documents = structure(list(texts = c("..., video game consoles, stereos, smartphone chargers, and other similar devices constantly draw power into their power supplies. Unplug all of your chargers, whether it's for a tablet or a toothbrush. Electronics with standby or \"\"sleep\"\" modes: Desktop PCs, televisions, cable boxes, DVD-ray players, alarm clocks, radios, and anything with a remote",
"...its judgment and order dated 02.05.2016 in Modern Dental College Research Centre (supra) authorizing it to oversee all statutory functions under the Act and leaving it at liberty to issue appropriate remedial directions, the impugned order is in the teeth of the recommendations of the said Committee, as communicated in its letter dated 14.05.2017",
"...' focus to the ayurveda sector, especially in oral care. A year ago, Colgate launched its first India-focused ayurvedic brand, Cibaca Vedshakti, aimed squarely at countering Dant Kanti. HUL too launched araft of ayurvedic personal care products, including toothpaste, under the Ayush brand. RIVAL TO WATCH OUT FOR Colgate Palmolive global CEO Ian",
"...founder of Increate Value Advisors. Patanjali has brought the focus back on product efficacy. Rising above the noise of advertising, products have to first deliver value to the consumers. Ghee and tooth paste are the two most popular products of Patanjali even though both of these have enough local and multinational competitors in the organised",
"The Bombay High Court today came down heavily on the Maharashtra government for not providing space and or hiring enough employees for the State Human Rights Commission. The commission has been left a toothless tiger as due to a lack of space and employees, it has not been able to hear cases of human rights violations in Maharashtra. A division"
)), .Names = "texts", row.names = c("text1", "text2", "text3",
"text4", "text5"), class = "data.frame"), metadata = structure(list(
source = "D:/Users/ajoshi/Documents/* on x86-64 by ajoshi",
created = "Fri Jan 26 19:42:21 2018"), .Names = c("source",
"created")), settings = structure(list(stopwords = NULL, collocations = NULL,
dictionary = NULL, valuetype = "glob", stem = FALSE, delimiter_word = " ",
delimiter_sentence = ".!?", delimiter_paragraph = "\n\n",
clean_tolower = TRUE, clean_remove_digits = TRUE, clean_remove_punct = TRUE,
units = "documents"), .Names = c("stopwords", "collocations",
"dictionary", "valuetype", "stem", "delimiter_word", "delimiter_sentence",
"delimiter_paragraph", "clean_tolower", "clean_remove_digits",
"clean_remove_punct", "units"), class = c("settings", "list")),
tokens = NULL), .Names = c("documents", "metadata", "settings",
"tokens"), class = c("corpus", "list"))

Output of R sessionInfo(): R version 3.4.3
other attached packages:
[1] servr_0.8 LDAvis_0.3.2 text2vec_0.5.1 stringr_1.2.0 data.table_1.10.4-3
[6] quanteda_0.99.22

loaded via a namespace (and not attached):
[1] Rcpp_0.12.15 compiler_3.4.3 pillar_1.1.0 futile.logger_1.4.3 plyr_1.8.4
[6] futile.options_1.0.0 iterators_1.0.9 tools_3.4.3 digest_0.6.14 lubridate_1.7.1
[11] tibble_1.4.1 gtable_0.2.0 lattice_0.20-35 rlang_0.1.6 Matrix_1.2-12
[16] foreach_1.4.4 fastmatch_1.1-0 mlapi_0.1.0 grid_3.4.3 R6_2.2.2
[21] RJSONIO_1.3-0 ggplot2_2.2.1 lambda.r_1.2 spacyr_0.9.3 magrittr_1.5
[26] scales_0.5.0 codetools_0.2-15 mime_0.5 colorspace_1.3-2 httpuv_1.3.5
[31] stringi_1.1.6 proxy_0.4-21 RcppParallel_4.3.20 lazyeval_0.2.1 munsell_0.4.3

最佳答案

这是我使用 quanteda v1.0.0 的系统上的结果:

require(quanteda)
txt <- c("..., video game consoles, stereos, smartphone chargers, and other similar devices constantly draw power into their power supplies. Unplug all of your chargers, whether it's for a tablet or a toothbrush. Electronics with standby or \"\"sleep\"\" modes: Desktop PCs, televisions, cable boxes, DVD-ray players, alarm clocks, radios, and anything with a remote",
"...its judgment and order dated 02.05.2016 in Modern Dental College Research Centre (supra) authorizing it to oversee all statutory functions under the Act and leaving it at liberty to issue appropriate remedial directions, the impugned order is in the teeth of the recommendations of the said Committee, as communicated in its letter dated 14.05.2017",
"...' focus to the ayurveda sector, especially in oral care. A year ago, Colgate launched its first India-focused ayurvedic brand, Cibaca Vedshakti, aimed squarely at countering Dant Kanti. HUL too launched araft of ayurvedic personal care products, including toothpaste, under the Ayush brand. RIVAL TO WATCH OUT FOR Colgate Palmolive global CEO Ian",
"...founder of Increate Value Advisors. Patanjali has brought the focus back on product efficacy. Rising above the noise of advertising, products have to first deliver value to the consumers. Ghee and tooth paste are the two most popular products of Patanjali even though both of these have enough local and multinational competitors in the organised",
"The Bombay High Court today came down heavily on the Maharashtra government for not providing space and or hiring enough employees for the State Human Rights Commission. The commission has been left a toothless tiger as due to a lack of space and employees, it has not been able to hear cases of human rights violations in Maharashtra. A division")
corp <- corpus(txt)
col <- textstat_collocations(corp, method = "lambda", size = 2, min_count = 1, smoothing = 0.5, tolower = TRUE)

head(col)

collocation count count_nested length lambda z
1 human rights 2 0 2 7.742836 3.689434
2 colgate launched 1 0 2 5.030438 3.553188
3 rights commission 1 0 2 5.030438 3.553188
4 ayurvedic brand 1 0 2 5.030438 3.553188
5 enough employees 1 0 2 5.030438 3.553188
6 launched its 1 0 2 5.030438 3.553188

table(col$length)

2
226

所有的搭配都有两个元素。我猜你看到的是更大的搭配,因为你的文本没有正确标记。

关于r - 在 R 中使用 quanteda 的 2 个单词短语搭配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48495714/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com