gpt4 book ai didi

r - 使用灵活的调用(在循环中使用)从宽到长的不同宽度的数据透视

转载 作者:行者123 更新时间:2023-12-03 14:53:29 26 4
gpt4 key购买 nike

我需要一些宽的时间序列数据,即使用 改变宽度的宽度的 pivot_longer() .

数据是季度数据,但我在年块(四个季度)和六个月块(只有两个季度)中收到数据,即数据在宽度方面有所不同。

我想找一个简单 灵活 可以在循环中使用的解决方案,因为我需要导入多年零六个月的块(并且,因为我需要说服我的研究小组使用 R,我在这里要求使用 (最好))。

年份块中的数据看起来有点像这样,

dta_wide1 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2019", "label1", "", "7", "5", "6", "1", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.5", "0.4", "0.8", "0.3", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "21", "21", "87", "21", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2019", "label1", "", "5", "2", "3", "7", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.46", "0.72", "0.7", "0.8", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "19", "22", "85", "25", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", ""), V10 = c("", "", "", "", "As of 30/09/2019", "label1", "", "4", "1", "4", "8", "", ""), V11 = c("", "", "", "", "", "", "label2", "0.1", "0.3", "0.6", "0.22", "", ""), V12 = c("", "", "", "", "", "label4", "label4a", "21", "23", "71", "27", "", ""), V13 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V14 = c("", "", "", "", "As of 31/12/2019", "label1", "", "8", "6", "9", "9", "", ""), V15 = c("", "", "", "", "", "", "label2", "0.7", "0.87", "0.55", "0.33", "", ""), V16 = c("", "", "", "", "", "label4", "label4a", "24", "25", "99", "35", "", ""), V17 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))

就像这样在六个月的街区里,
dta_wide2 <- structure(list(V1 = c("", "", "", "", "", "", "", "peach", "dragonfruit", "honeydew", "huckleberry", "", ""), V2 = c("ABC", "some info", "Store A", "", "As of 31/03/2020", "label1", "", "2", "3", "4", "8", "(a) some useless clutter", "(b) more not relevent information"), V3 = c("", "", "", "", "", "", "label2", "0.1", "0.2", "0.3", "0.8", "", ""), V4 = c("", "", "", "", "", "label4", "label4a", "10", "11", "12", "9", "", ""), V5 = c("", "", "", "", "", "", "label4b", "0.3", "0.1", "0.4", "0.2", "", ""), V6 = c("", "", "", "", "As of 30/06/2020", "label1", "", "4", "6", "8", "16", "", ""), V7 = c("", "", "", "", "", "", "label2", "0.22", "0.33", "0.44", "0.55", "", ""), V8 = c("", "", "", "", "", "label4", "label4a", "11", "12", "13", "10", "", ""), V9 = c("", "", "", "", "", "", "label4b", "0.4", "0.1", "0.3", "0.2", "", "")), class = "data.frame", row.names = c(NA, -13L))

即(对于六个月的块)
# install.packages(c("tidyverse"), dependencies = TRUE)
library(tidyverse)
dta_wide2 %>% as_tibble
# A tibble: 13 x 9
V1 V2 V3 V4 V5 V6 V7 V8 V9
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "" "ABC" "" "" "" "" "" "" ""
2 "" "some info" "" "" "" "" "" "" ""
3 "" "Store A" "" "" "" "" "" "" ""
4 "" "" "" "" "" "" "" "" ""
5 "" "As of 31/03/~ "" "" "" "As of ~ "" "" ""
6 "" "label1" "" "label~ "" "label1" "" "labe~ ""
7 "" "" "labe~ "label~ "labe~ "" "lab~ "labe~ "labe~
8 "peach" "2" "0.1" "10" "0.3" "4" "0.2~ "11" "0.4"
9 "dragon~ "3" "0.2" "11" "0.1" "6" "0.3~ "12" "0.1"
10 "honeyd~ "4" "0.3" "12" "0.4" "8" "0.4~ "13" "0.3"
11 "huckle~ "8" "0.8" "9" "0.2" "16" "0.5~ "10" "0.2"
12 "" "(a) some use~ "" "" "" "" "" "" ""
13 "" "(b) more not~ "" "" "" "" "" "" ""

dta_wide2日期键像这样四处飘荡
> dta_wide2[5,] %>% str_sub(start= -10) %>% lubridate::dmy()
[1] NA "2020-03-31" NA NA NA
[6] "2020-06-30" NA NA NA

所以我试着像这样整理
dta_wide2 %>% 
add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2) %>%
add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>%
add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble

# A tibble: 13 x 12
V1 store date1 V2 V3 V4 date2 V5 V6 V7
<chr> <chr> <date> <chr> <chr> <chr> <date> <chr> <chr> <chr>
1 "" Stor~ 2020-03-31 "ABC" "" "" 2020-06-30 "" "" ""
2 "" Stor~ 2020-03-31 "som~ "" "" 2020-06-30 "" "" ""
3 "" Stor~ 2020-03-31 "Sto~ "" "" 2020-06-30 "" "" ""
4 "" Stor~ 2020-03-31 "" "" "" 2020-06-30 "" "" ""
5 "" Stor~ 2020-03-31 "As ~ "" "" 2020-06-30 "" "As ~ ""
6 "" Stor~ 2020-03-31 "lab~ "" "lab~ 2020-06-30 "" "lab~ ""
7 "" Stor~ 2020-03-31 "" "lab~ "lab~ 2020-06-30 "lab~ "" "lab~
8 "pea~ Stor~ 2020-03-31 "2" "0.1" "10" 2020-06-30 "0.3" "4" "0.2~
9 "dra~ Stor~ 2020-03-31 "3" "0.2" "11" 2020-06-30 "0.1" "6" "0.3~
10 "hon~ Stor~ 2020-03-31 "4" "0.3" "12" 2020-06-30 "0.4" "8" "0.4~
11 "huc~ Stor~ 2020-03-31 "8" "0.8" "9" 2020-06-30 "0.2" "16" "0.5~
12 "" Stor~ 2020-03-31 "(a)~ "" "" 2020-06-30 "" "" ""
13 "" Stor~ 2020-03-31 "(b)~ "" "" 2020-06-30 "" "" ""
# ... with 2 more variables: V8 <chr>, V9 <chr>

现在,我需要使用更长的时间来旋转它,如果我得到正确的, pivot_longer ,但是我的挑战是如何 - 当我还获得看起来像 dta_wide1 的数据时,即有四个季度——我是否以一种灵活的方式来做,我可以同时使用 dta_wide1dta_wide2 .

我已经在这方面工作了一段时间,任何使它工作、简单化或清理它的帮助都会非常有用。

这是我目前所处的位置,但不正确,不灵活,而且不简单
dta_wide2_foo <- dta_wide2
names(dta_wide2_foo) <- c('goods', paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_1'), paste0(dta_wide2[6,2:5], dta_wide2[7,2:5], sep = '_2'))
dta_wide2_foo %>%
add_column(date1 = dta_wide2[5,2] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 2) %>%
add_column(date2 = dta_wide2[5,6] %>% str_sub(start= -10) %>% lubridate::dmy(), .before = 6) %>%
add_column(store = dta_wide2[3,2], .before = 2) %>% as_tibble %>% .[8:11,] %>%
pivot_longer(-c(goods, store, date1, date2), values_to = "Value", names_to = "variable") %>% print(n = 100)

或者,一些通用的片段,它既不简单,也不聪明,也不干净,但它可用于获取循环中两个样本数据中日期的位置
dta <- dta_wide2
dta[5,] %>% str_sub(start= -10) %>% lubridate::dmy() %>% { which(!is.na(.)) }
[1] 2 6

或者,更清洁,
dta <- dta_wide1
dta[5,] %>% grep("As ",.)
[1] 2 6 10 14

更新 2020-06-08 07:45:18Z

我的目标是结合长数据集来绘制数据,( Wimpel suggest below 我结合了不同的宽数据集,即 dta_wide1dta_wide2... dta_widen ,使用 lapply()调用)我想象数据看起来像这样,
> dta_long
# A tibble: 96 x 5
product label value date store
<chr> <chr> <dbl> <date> <chr>
1 peach label1 7 2019-03-31 Store A
2 peach label2 0.5 2019-03-31 Store A
3 peach label4a 21 2019-03-31 Store A
4 peach label4b 0.3 2019-03-31 Store A
5 peach label1 5 2019-06-30 Store A
6 peach label2 0.46 2019-06-30 Store A
7 peach label4a 19 2019-06-30 Store A
8 peach label4b 0.4 2019-06-30 Store A
9 peach label1 4 2019-09-30 Store A
10 peach label2 0.1 2019-09-30 Store A
# ... with 86 more rows

然后 /用这样的东西绘制日期,
dta_long %>% filter(label == 'label1') %>% ggplot(aes(date, value, colour = product)) + 
geom_line() + scale_x_date(date_breaks = "3 months",
date_labels = "%b-%y", limits = c((min(dta_long$date)-34), max = max(dta_long$date)))

enter image description here

最佳答案

我保存了您的两个示例数据集并将它们存储在单独的 .xlsb 文件中。
数据如下所示:

enter image description here

enter image description here

也许这会有所帮助......该解决方案适用于提供的两个样本集,所以试一试。
该代码假设所有数据都具有相同的格式,因此所有信息始终位于同一行中,而 storename 始终位于同一列中。

library( readxlsb )
library( cellranger )
library( tidyverse )
library( data.table )

#get filesnames to read
read.these.files <- list.files( path = "./temp/",
pattern = ".*\\.xlsb",
full.names = TRUE,
recursive = FALSE )
#now read the data to a list, using lapply()
# assuming the data needed is in the first sheet of the .xlsb-file
L <- lapply( read.these.files, readxlsb::read_xlsb, sheet = 1, range = cellranger::cell_limits() )
#now we can loop over the read in data in list 'L', and perform operations
L.dt <- lapply( L, function(x) {
#get store_name
store_name = x[2,2]
#get the data
df1 <- x[7:10,]
#set the colmanes (=labels) right
colnames <- x[5:6,]
colnames[ colnames == "" ] <- NA
names(df1) <- colnames %>% tidyr::fill( names(colnames) ) %>% slice(2)
names(df1)[1] <- "product"
#melt df1 to long format
df1 <- df1 %>% tidyr::pivot_longer( cols = tidyselect::starts_with("label"), names_to = "label" )
#set the dates right
dates <- x[4, ]
dates <- dates %>% tidyr::pivot_longer( cols = tidyselect::everything())
dates[ dates == "" ] <- NA
dates <- tidyr::fill( dates, value ) %>% dplyr::slice(2:n() )
#add the dates and storename and tidy the .copy column
df1 <- df1 %>%
dplyr::mutate( date = rep( dates$value, nrow(df1) / length( dates$value) ),
store = store_name ) %>%
dplyr::select( -.copy )
})
#create a names list, based on the sourecefile-names
names(L.dt) <- basename( read.these.files )
#now, bind the list of alterend data together into one _long_ data set
L.dt_tbl <- bind_rows(L.dt, .id = 'id')
L.dt_tbl %>% dplyr::mutate(date = str_sub(date, start= -10) %>%
lubridate::dmy() ) -> L.dt_tbl
'

转换 value输入 double ,
dta_long <- type_convert(L.dt_tbl, cols(
`Type of NPE` = col_character(),
`What NPE` = col_character(),
value = col_double(),
institut = col_character()
))

最终数据,
dta_long
# A tibble: 96 x 6
id product label value date store
<chr> <chr> <chr> <dbl> <date> <chr>
1 dta_wide1.xlsb peach label1 7 2019-03-31 Store A
2 dta_wide1.xlsb peach label2 0.5 2019-03-31 Store A
3 dta_wide1.xlsb peach label4a 21 2019-03-31 Store A
4 dta_wide1.xlsb peach label4b 0.3 2019-03-31 Store A
5 dta_wide1.xlsb peach label1 5 2019-06-30 Store A
6 dta_wide1.xlsb peach label2 0.46 2019-06-30 Store A
7 dta_wide1.xlsb peach label4a 19 2019-06-30 Store A
8 dta_wide1.xlsb peach label4b 0.4 2019-06-30 Store A
9 dta_wide1.xlsb peach label1 4 2019-09-30 Store A
10 dta_wide1.xlsb peach label2 0.1 2019-09-30 Store A
# ... with 86 more rows

关于r - 使用灵活的调用(在循环中使用)从宽到长的不同宽度的数据透视,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62153859/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com