gpt4 book ai didi

R读取逗号分隔的文本文件,一列内有逗号

转载 作者:行者123 更新时间:2023-12-04 17:48:10 25 4
gpt4 key购买 nike

我有一些用户浏览行为的日志。它来自数据收集器,显然他使用逗号来分隔变量。但是有些 URL 里面确实有逗号。我无法将 txt 文件读入 R。

20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank

上面的 URL 应该是:
http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1

https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp

我如何告诉 R 每行正好有 10 个变量并在 URL 中放置逗号?谢谢!
df <- read.table('2009.txt', sep= ',', quote= '', comment.char= '', stringsAsFactors= F)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 130 did not have 10 elements

最佳答案

你可以试试:

  dat <- read.table(text=gsub("http:.*(?=(,www)|,,)(*SKIP)(*F)|,", "*",
Lines, perl=TRUE), sep="*", header=FALSE, stringsAsFactors=FALSE)


dat
# V1 V2 V3 V4 V5
#1 20091 2009-06-02 22:06:14 84 taobao.com search1.taobao.com
#2 20092 2009-06-16 12:25:35 8 sohu.com www.wap.sohu.com
#3 20092 2009-06-07 16:02:03 14 eetchina.com www.powersystems.eetchina.com
# V6
#1 http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq------- 2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
#2 http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
#3 http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
# V7 V8 V9 V10
#1 www.taobao.com shopping e-commerce C2C
#2 www.sohu.com portal entertainment mobile
#3 others marketing enterprise

数据
 Lines <-  readLines(textConnection(txt)) #(`txt` from @Richard Scriven)

更新

使用您的新数据集
 indx <- grep("http", Lines)
Lines1 <- Lines[indx]
pat1 <- paste(unique(gsub(".*http[s]?.{3}(\\w+)\\..*", "\\1", Lines1)), collapse="|")
pat1N <- paste0("http:.*(?=,(", pat1, "|,))(*SKIP)(*F)|,")

dat1 <- read.table(text=gsub(pat1N, "*", Lines, perl=TRUE),
sep="*", header=FALSE, stringsAsFactors=FALSE)

dat1
# V1 V2 V3 V4 V5
#1 20091 2009-06-02 22:06:14 84 taobao.com search1.taobao.com
#2 20092 2009-06-16 12:25:35 8 sohu.com www.wap.sohu.com
#3 20092 2009-06-07 16:02:03 14 eetchina.com www.powersystems.eetchina.com
#4 20096 2009-06-30 07:51:38 7 taobao.com search1.taobao.com
#5 2009184 2009-06-25 14:40:39 6 mktginc.com surv.mktginc.com
#6 20092 2009-06-07 15:13:06 32 ccb.com.cn ibsbjstar.ccb.com.cn
# V6
# 1 http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
# 2 http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
# 3 http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
# 4 http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1
#5
#6 https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp
# V7 V8 V9 V10
#1 www.taobao.com shopping e-commerce C2C
#2 www.sohu.com portal entertainment mobile
#3 others marketing enterprise
#4 search1.taobao.com shopping e-commerce C2C
#5 unknown unknown unknown
#6 e-bank finance e-bank

数据
 txt <- '20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank'

Lines <- readLines(textConnection(txt))

关于R读取逗号分隔的文本文件,一列内有逗号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26116483/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com