gpt4 book ai didi

正则表达式在长文本字符串文件的特定点插入新行\n

转载 作者:行者123 更新时间:2023-12-05 05:30:12 25 4
gpt4 key购买 nike

我有 csv 数据的文本文件,其中包含数十万条应该是单独记录的内容,但他们忘记在其中添加新行。有一个重复的模式来选择新行的开始位置,在时间、逗号和名称之前,例如从下面“07:04:08.401,Buzzard”。但是因为字符串在文件中持续了 1000 行,所以我不能使用开始 ^ 或结束 $ 来锚定字符串。

我的计划是从每个点的开始向后直到下一个逗号进行正则表达式,这样我就可以将 str_replace() 本身放回去,但以“\n”结尾,从而在我的位置插入新行想要他们。

我在这两个部分都需要帮助。

library(stringr)
library(data.table)

Data_raw <- c("07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.0007:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.3107:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.")

look_x <- function(rx) str_view_all(Data_raw, rx)
look_x("[:graph:]{4}(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)")

获取前面的四个字符。但是时间回到下一个逗号之前的字符是可变的。例如在上方,它们的范围从“0.00”到“-401.31”和“Obj 2 N.A.”。所以它是逗号。所以我一直在尝试:

look_xy("(?<=,).(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)")

..并且未能让每个字符都以“,”开头,然后是任何 hh:mm:ss.sss,接下来是 Buzz。

我还需要下一步的帮助,我已经尝试过:

Data_st_rep_all_2 <- data.frame(str_replace_all("[:graph:]{4}(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)",
paste0(str_extract(Data_raw, "[:graph:]{4}(?=\\d\\d:\\d\\d:\\d\\d.\\d\\d\\d,Buzz)"),"\n"), Data_raw))

尽管我现在想知道这是否可行,因为所有正则表达式片段都不一样。

我卡住了。谁能帮忙?!

毫无疑问,我完全错过了一个非常巧妙的解决方案!

谢谢。

最终结果应该是这样的:

Data_1 <- data.frame(Records = c("07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.00",
"07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.31",
"07:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.",
"07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A."))
mysplits <- max(lengths(strsplit(Data_1$Records, ",")))
Data_2 <- setDT(Data_1)[, paste0("column", 1:mysplits) := tstrsplit(Records, ",", fixed=T)]
Data_2[, Records := NULL]

或者说:

Data_raw_2 <- c("07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.00\n07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.31\n07:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.\n07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A.")
wd <- getwd()
write_lines(Data_raw_2, paste0(wd, '/', 'Data_raw_2.txt'))

最佳答案

这是你需要的吗?

library(stringr)
str_split(Data_raw, "(?<!^)(?=\\d{2}:\\d{2}:\\d{2}\\.\\d{3},Buzzard Brook)")
[[1]]
[1] "07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,326800.31,6749792.66,BIG Box,0.00,0.00,0.00"
[2] "07:04:08.401,Buzzard Brook,123456.78,1234567.89,196.25,-0.41,-0.60,0.07,LARS,123456.78,1234567.89,BIG Box,0.00,0.00,-401.31"
[3] "07:02:55.357,Buzzard Brook,123456.78,1234567.89,50.41,-0.42,-0.01,0.01,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A."
[4] "07:03:10.364,Buzzard Brook,123456.78,1234567.89,50.27,-0.20,-0.03,0.00,LARS,123456.78,1234567.89,BIG Box,Obj 2 N.A.,Obj 2 N.A.,Obj 2 N.A."

这是如何工作的:

  • (?<!^) :否定后视断言我们不想在字符串开始时拆分
  • (?=\\d{2}:\\d{2}:\\d{2}\\.\\d{3},Buzzard Brook) :正向后视断言我们拆分的点后面必须跟一个类似时间戳的表达式、一个逗号和字符串“Buzzard Brook”

关于正则表达式在长文本字符串文件的特定点插入新行\n,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/74761274/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com