gpt4 book ai didi

r - R 中向量的字符频率

转载 作者:行者123 更新时间:2023-12-03 20:30:28 24 4
gpt4 key购买 nike

我有一个名为 Frankenstein.txt 的电子书文本文件我想知道每个字母在小说中使用了多少次。

我的设置:

我像这样导入了文本文件以获得字符向量 character_array

string <- readChar("Frankenstein.txt", filesize)
character_array <- unlist(strsplit(string, ""))
character_array给我这样的东西。
 "F" "r" "a" "n" "k" "e" "n" "s" "t" "e" "i" "n" "\r", ...

我的目标:

我想获取文本文件中每次出现字符的次数。换句话说,我想为每个 unique(character_array) 计数。
 [1] "F"  "r"  "a"  "n"  "k"  "e"  "s"  "t"  "i"  "\r" "\n" "b"  "y"  "M" 
[15] " " "W" "o" "l" "c" "f" "(" "G" "d" "w" ")" "S" "h" "C"
[29] "O" "N" "T" "E" "L" "1" "2" "3" "4" "p" "5" "6" "7" "8"
[43] "9" "0" "_" "." "v" "," "g" "P" "u" "D" "—" "Y" "j" "m"
[57] "I" "z" "?" ";" "x" "q" "B" "U" "’" "H" "-" "A" "!" ":"
[71] "R" "J" "“" "”" "æ" "V" "K" "[" "]" "‘" "ê" "ô" "é" "è"

我的尝试
当我打电话时 plot(as.factor(character_array))我得到了一个很好的图表,它在视觉上给了我我想要的东西。
enter image description here
但是,我需要获取每个字符的确切值。我想要类似二维数组的东西:
    [,1]   [,2] [,3] [,4] ... 
[1,] "a" "A" "b" "B" ...
[2,] "1202" "50" "12" "9" ...

最佳答案

制作此类文本处理管道的一种好方法是使用 magrittr::%>%管道。这是一种方法,假设您的文本位于 "frank.txt" (每个步骤的解释见底部):

library(magrittr)

# read the text in
frank_txt <- readLines("frank.txt")

# then send the text down this pipeline:
frank_txt %>%
paste(collapse="") %>%
strsplit(split="") %>% unlist %>%
`[`(!. %in% c("", " ", ".", ",")) %>%
table %>%
barplot

请注意,您可以停在 table()并将结果分配给一个变量,然后您可以根据需要对其进行操作,例如通过绘制它:
char_counts <- frank_txt %>% paste(collapse="") %>% 
strsplit(split="") %>% unlist %>% `[`(!. %in% c("", " ", ".", ",")) %>%
table

barplot(char_counts)

您还可以将表格转换为数据框,以便以后更轻松地操作/绘图:
counts_df <- data.frame(
char = names(char_counts),
count = as.numeric(char_counts),
stringsAsFactors=FALSE)

head(counts_df)
## char count
## a 13
## b 2
## c 7
## d 5
## e 24
## f 6

每一步解释:这是完整的管道链,每个步骤都有解释:
# going to send this text down a pipeline:
frank_txt %>%
# combine lines into a single string (makes things easier downstream)
paste(collapse="") %>%
# tokenize by character (strsplit returns a list, so unlist it)
strsplit(split="") %>% unlist %>%
# remove instances of characters you don't care about
`[`(!. %in% c("", " ", ".", ",")) %>%
# make a frequency table of the characters
table %>%
# then plot them
barplot

请注意,这完全等同于以下可怕的( "monstrous" ?!?!)代码——前向管道 %>%只需将其右侧的函数应用于其左侧的值(而 . 是指代左侧值的代词;参见 intro vignette ):
barplot(table(
unlist(strsplit(paste(frank_txt, collapse=""), split=""))[
!unlist(strsplit(paste(frank_txt, collapse=""), split="")) %in%
c(""," ",".",",")]))

关于r - R 中向量的字符频率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49350040/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com