gpt4 book ai didi

r - 以整洁的方式将字符串列表转换为 data.frame

转载 作者:行者123 更新时间:2023-12-04 11:50:59 27 4
gpt4 key购买 nike

我想知道是否有人对如何处理转换数据有任何提示/技巧,如下所示:

library(tidyverse)
example.list = list(" 1 North Carolina State University at Raleigh 15 9 12 13 22 15 32 19 14 20 12 17 19 20 19 25 283",
" 2 Iowa State University 9 8 5 11 14 4 11 13 14 9 15 28 14 9 18 27 209",
" 3 University of Wisconsin-Madison 5 6 14 9 20 13 15 12 13 9 13 10 13 24 15 17 208",
" 4 Stanford University* 10 12 14 6 9 10 5 9 13 7 13 10 4 9 23 6 160",
" 5 Texas A & M University-College Station 6 12 18 10 7 4 5 11 16 18 10 7 15 4 8 8 159",
" 9 University of Michigan-Ann Arbor 8 5 3 3 8 9 12 11 7 11 13 9 8 11 13 9 140",
"10 University of California-Los Angeles 2 2 2 6 9 7 9 8 7 11 11 8 6 12 13 10 123",
"19 Rice University 3 3 5 11 4 7 7 11 2 6 4 6 3 8 7 7 94")

进入类似这里的输出:
example.list %>%
substring(3) %>%
str_replace_all("[^[:alnum:]]", " ") %>%
str_squish() %>%
strsplit(split = "(?<=[a-zA-Z])\\s*(?=[0-9])", perl = TRUE) %>%
unlist() %>%
matrix(ncol = 2, byrow = TRUE) %>%
data.frame() %>%
separate("X2",into = paste0("X",2:18),sep = " ")

需要提取的一般模式是将所有字符放入其自己的列中,直到第一个数字,所有其他列由空格分隔到其他列中。

有趣的是,大部分内容是否可以在单个正则表达式模式中完成,或者根本没有它。

我只是想改进字符串处理,因为我没有太多使用它!这里的用例就像尝试将表格数据从 pdf/html 提取到 data.frame 中。

编辑:

我感谢所有的建议和不同的观点!

我意识到我实际上错过了一些值得一提的测试用例:
example2.list = list(" 2 Iowa State University                                9      8     5    11     14     4    11    13   14      9   15     28    14      9    18     27     209", 
" 3 University of Wisconsin-Madison 5 6 14 9 20 13 15 12 13 9 13 10 13 24 15 17 208",
" 4 Stanford University* 10 12 14 6 9 10 5 9 13 7 13 10 4 9 23 6 160",
" 5 Texas A & M University-College Station 6 12 18 10 7 4 5 11 16 18 10 7 15 4 8 8 159",
" 9 University of Michigan-Ann Arbor 8 5 3 3 8 9 12 11 7 11 13 9 8 11 13 9 140",
"10 University of California-Los Angeles 2 2 2 6 9 7 9 8 7 11 11 8 6 12 13 10 123",
"19 Rice University 3 3 5 11 4 7 7 11 2 6 4 6 3 8 7 7 94",
"52 Bowling Green State University 0 0 0 0 0 1 5 2 2 2 4 7 3 4 4 3 37",
"55 University of New Mexico 4 2 3 1 3 0 5 3 2 1 1 2 3 2 3 0 35")

它实际上并没有像对齐那样整齐。

完整数据集,稍微清理一下:
library(pdftools)
library(tidyverse)
data.loc = "https://ww2.amstat.org/misc/StatsPhD2003-MostRecent.pdf"
data.full =
pdf_text(data.loc) %>%
read_lines() %>%
head(-2) %>%
tail(-3) %>%
lapply(function(ele) if(ele == "") NULL else ele) %>%
compact()

这是我的第二次尝试:
library(tidyverse)
library(magrittr)
# Ignores column names
data.full[-1] %>%
# Removing excess whitepace
str_squish() %>%
# Removes index
str_remove("^\\s*\\d*\\s*") %>%
# Split on all whitespace occurring before digits
str_split("\\s+(?=\\d)") %>%
# Turn list into a matrix
unlist() %>%
matrix(ncol = 18, byrow = TRUE) %>%
# Handling variables names
set_colnames(value =
data.full[1] %>%
str_squish() %>%
str_split("\\s+(?=\\d)") %>%
unlist) %>%
as_tibble() %>%
# Transformating variables into numeric
type_convert()

最佳答案

这是您可以采用的一种方法:

library(magrittr)
library(data.table)


gsub("^...", "", example.list) %>%
tstrsplit(" {2,}", type.convert = TRUE, names = TRUE) %>%
as.data.frame()

# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
# 1 North Carolina State University at Raleigh 15 9 12 13 22 15 32 19 14 20 12 17 19 20 19 25 283
# 2 Iowa State University 9 8 5 11 14 4 11 13 14 9 15 28 14 9 18 27 209
# 3 University of Wisconsin-Madison 5 6 14 9 20 13 15 12 13 9 13 10 13 24 15 17 208
# 4 Stanford University* 10 12 14 6 9 10 5 9 13 7 13 10 4 9 23 6 160
# 5 Texas A & M University-College Station 6 12 18 10 7 4 5 11 16 18 10 7 15 4 8 8 159
# 6 University of Michigan-Ann Arbor 8 5 3 3 8 9 12 11 7 11 13 9 8 11 13 9 140
# 7 University of California-Los Angeles 2 2 2 6 9 7 9 8 7 11 11 8 6 12 13 10 123
# 8 Rice University 3 3 5 11 4 7 7 11 2 6 4 6 3 8 7 7 94

关于r - 以整洁的方式将字符串列表转换为 data.frame,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61605707/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com