gpt4 book ai didi

regex - 非贪婪 gsub

转载 作者:行者123 更新时间:2023-12-04 20:42:26 26 4
gpt4 key购买 nike

我有一个日志数据集:

V1  duration  id  startpoint
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 7771 1 2012-05-07_12-29-51
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7771 1 2012-05-07_12-29-51
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 7771 1 2012-05-07_12-29-51
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 7771 1 2012-05-07_12-29-51 211

我正在尝试从第一列(时间点、进程、pid、url 等)中提取信息。起初我试过:
df$timepoint <- gsub("T<=>(.*)[=].*", "\\1", df$V1)

它返回类似 161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V< ,然后我尝试:
df$timepoint <- gsub("T<=>([0-9]*).*", "\\1", df$V1)

它有效,但在处理诸如进程名称之类的文本时不起作用,所以我搜索了“正则表达式最小匹配”并找到了术语 non-greedy .我又试了一次:
df$timepoint <- gsub("T<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$process <- gsub(".*P<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$pid <- gsub(".*I<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$url <- gsub(".*U<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$addr <- gsub(".*A<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$tab <- gsub(".*B<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$ver <- gsub(".*V<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$window <- gsub(".*W<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$name <- gsub(".*N<=>(.*?)\\[=\\].*", "\\1", df$V1)
df$company <- gsub(".*C<=>(.*?)", "\\1", df$V1)

并非每一行都包含所有信息并且发生了问题。如果没有关于软件名称或公司名称的信息,R 只需将 V1 复制到新的 var 中。如果软件版本信息在 V1 的末尾,则正则表达式 ".*V<=>(.*?)\\[=\\].*"还将整个字符串复制到新的var:
V1  duration  id  startpoint  timepoint process pid url addr  tab ver window  name  company
T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 7771 1 2012-05-07_12-29-51 161 explorer.exe 1820 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 20094 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512 T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512
T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7771 1 2012-05-07_12-29-51 195 360Safe.exe 1732 T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn 7, 5, 0, 1501 1017e 360安全卫士 360.cn
T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360极速浏览器[=]C<=>360.cn 7771 1 2012-05-07_12-29-51 203 360chrome.exe 436 NULL 2027a 20290 5.2.0.804 T<=>203[=]P<=>360chrome.exe[=]I<=>436[=]U<=>NULL[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804[=]N<=>360极速浏览器[=]C<=>360.cn 360极速浏览器 360.cn
T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 7771 1 2012-05-07_12-29-51 209 360Safe.exe 1732 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 1017e T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501 T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501
T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 7771 1 2012-05-07_12-29-51 211 360chrome.exe 436 www.hao123.com 2027a 20290 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804 T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804

我认为如果 R 找不到 'C<=>' (例如),那么之后就没有 (.*?) 了。这将是一个空字符串,但输出占用了整个字符串。任何人都可以帮我解决它吗?谢谢!

更新

感谢 MrFlick 的评论,我刚刚得到了一个基于 this answer 的解决方案:

以提取软件名称信息的过程为例,
ind1 <- grep(".*N<=>(.*?)\\[=\\].*", df$V1, value= FALSE) # see if pattern exists with follow-up
ind2 <- grep(".*N<=>(.*?)", df$V1, value= FALSE) # see if pattern exists
df$name <- ""
df$name[ind2] <- gsub(".*N<=>(.*?)", "\\1", df$V1) # replace the ones with pattern match
df$name[ind1] <- gsub(".*N<=>(.*?)\\[=\\].*", "\\1", df$V1) # replace the ones with pattern match and follow-up

但是这个片段看起来很糟糕,如果它是最终的解决方案,我必须通过其他信息(进程、pid、版本、公司等)来完成它......有人可以帮助优化它吗?谢谢!

最佳答案

这是另一种策略。我们可以使用gregexpr分离堆叠数据的每一个片段。这是向量中的数据

V1<-c("T<=>161[=]P<=>explorer.exe[=]I<=>1820[=]W<=>20094[=]V<=>6.00.2900.5512", 
"T<=>195[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501[=]N<=>360安全卫士[=]C<=>360.cn",
"T<=>209[=]P<=>360Safe.exe[=]I<=>1732[=]W<=>1017e[=]V<=>7, 5, 0, 1501",
"T<=>211[=]P<=>360chrome.exe[=]I<=>436[=]U<=>www.hao123.com[=]A<=>2027a[=]B<=>20290[=]V<=>5.2.0.804")

现在我们可以用
m <- gregexpr("(\\w)<=>(.*?)(?:\\[=\\]|$)", V1, perl=T)

找出匹配的匹配项可能会很麻烦,但我使用函数 regcapturedmatches轻松获取所有匹配的数据。我使用它就像你使用内置的 regmatches
data <- regcapturedmatches(V1,m)

那么如果你检查 data你可以看到所有的信息都在那里。现在的问题是我们只需要将它构建为列而不是像现在这样的行。为此,我使用 reshape2
library(reshape2)

#combine list into one data.frame
sdata<-do.call(rbind, lapply(1:length(data),
function(i) data.frame(data[[i]], S=i)))

#turn rows into columns
dcast(sdata, S~X1, value.var="X2")

这又回来了
  S    I             P   T              V     W      C           N     A     B
1 1 1820 explorer.exe 161 6.00.2900.5512 20094 <NA> <NA> <NA> <NA>
2 2 1732 360Safe.exe 195 7, 5, 0, 1501 1017e 360.cn 360安全卫士 <NA> <NA>
3 3 1732 360Safe.exe 209 7, 5, 0, 1501 1017e <NA> <NA> <NA> <NA>
4 4 436 360chrome.exe 211 5.2.0.804 <NA> <NA> <NA> 2027a 20290
U
1 <NA>
2 <NA>
3 <NA>
4 www.hao123.com

您可以重命名列等,但一次执行所有转换实际上并不是那么多代码。

关于regex - 非贪婪 gsub,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23915953/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com