gpt4 book ai didi

r - 将非结构化 csv 文件转换为数据框

转载 作者:行者123 更新时间:2023-12-01 08:26:26 24 4
gpt4 key购买 nike

我正在学习 R 进行文本挖掘。我有一个 CSV 格式的电视节目时间表。节目通常从早上 06:00 开始,一直持续到第二天早上 05:00,这被称为广播日。例如:15/11/2015 的节目从早上 06:00 开始,到第二天早上 05:00 结束。

这是一个示例代码,显示了日程安排的样子:

 read.table(textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|"), header = F, sep = "|", stringsAsFactors = F)

其输出如下:
  V1|V2
Sunday |
01-Nov-15 |
6 | Tom
some information about the program |
23.3 | Jerry
some information about the program |
5 | Avatar
some information about the program |
5.3 | Panda
some information about the program |
Monday |
02-Nov-15|
6 Jerry
some information about the program |
6.25 | Panda
some information about the program |
23.3 | Avatar
some information about the program |
7.25 | Tom
some information about the program |

我想把上面的数据转换成data.frame的形式
Date            |Program|Synopsis
2015-11-1 06:00 |Tom | some information about the program
2015-11-1 23:30 |Jerry | some information about the program
2015-11-2 05:00 |Avatar | some information about the program
2015-11-2 05:30 |Panda | some information about the program
2015-11-2 06:00 |Jerry | some information about the program
2015-11-2 06:25 |Panda | some information about the program
2015-11-2 23:30 |Avatar | some information about the program
2015-11-3 07:25 |Tom | some information about the program

我很感谢任何有关我应该查看的功能或包的建议/提示。

最佳答案

的替代解决方案:

library(data.table)
library(zoo)
library(splitstackshape)

txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]

wd <- levels(weekdays(1:7, abbreviate = FALSE))

DT <- DT[, temp := tv %chin% wd
][, day := tv[temp], by = 1:nrow(tvDT)
][, day := na.locf(day)
][, temp := NULL
][, idx := rleid(day)
][, date := tv[2], by = idx
][, .SD[-c(1,2)], by = idx]

DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]
DT <- dcast(DT, idx + day + date + rowid(lbl) ~ lbl, value.var = "tv")[, lbl := NULL]

DT <- DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)
][, .(datetime, Program, Info)]

结果:
> DT
datetime Program Info
1: 2015-11-01 06:00:00 Tom some information about the program
2: 2015-11-01 23:30:00 Jerry some information about the program
3: 2015-11-02 05:00:00 Avatar some information about the program
4: 2015-11-02 06:00:00 Tom some information about the program
5: 2015-11-02 23:30:00 Jerry some information about the program
6: 2015-11-03 05:00:00 Avatar some information about the program

说明:

1:读取数据,转换为 data.table 并删除尾随 | :
txt <- textConnection("Sunday|\n 01-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|\nMonday|\n 02-Nov-15|\n 6|Tom\n some information about the program|\n 23.3|Jerry\n some information about the program|\n 5|Avatar\n some information about the program|")
tv <- readLines(txt)
DT <- data.table(tv)[, tv := gsub('[|]$', '', tv)]

2:将工作日提取到新列中
wd <- levels(weekdays(1:7, abbreviate = FALSE)) # a vector with the full weekdays
DT[, temp := tv %chin% wd
][, day := tv[temp], by = 1:nrow(tvDT)
][, day := na.locf(day)
][, temp := NULL]

3:每天创建一个索引并创建一个包含日期的列
DT[, idx := rleid(day)][, date := tv[2], by = idx]

4:删除不必要的行
DT <- DT[, .SD[-c(1,2)], by = idx]

5:将时间和程序名称拆分为单独的行并创建一个标签列
DT <- cSplit(DT, sep="|", "tv", "long")[, lbl := rep(c("Time","Program","Info")), by = idx]

6:使用 data.table 开发版本中的“rowid”函数将其 reshape 为宽格式
DT <- dcast(DT, idx + day + date + rowid(idx2) ~ idx2, value.var = "tv")[, idx2 := NULL]

7:创建一个日期时间列并将深夜时间设置为第二天
DT[, datetime := as.POSIXct(paste(as.character(date), sprintf("%01.2f",as.numeric(as.character(Time)))), format = "%d-%b-%y %H.%M")
][, datetime := datetime + (+(datetime < shift(datetime, fill=datetime[1]) & datetime < 6) * 24 * 60 * 60)]

8:保留所需的列
DT <- DT[, .(datetime, Program, Info)]

关于r - 将非结构化 csv 文件转换为数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33719058/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com