gpt4 book ai didi

R:将不规则长度的文本转换为数据帧

转载 作者:行者123 更新时间:2023-12-02 18:12:14 26 4
gpt4 key购买 nike

使用 readlines 导入后的文本的简化示例:

text <- c("just", "stuff", "nothing", "interesting", "date", "06.05.2022", 
"number", "1/3892", "adress", "north street 45", "name", "peter miller",
"just", "stuff", "nothing", "interesting", "date", "06.05.2022",
"number", "5/7283", "adress", "south street 11, fareaway", "west street 4",
"name", "john snow", "just", "stuff", "nothing", "interesting",
"date", "06.05.2022", "number", "7/112563", "adress", "island street 348",
"planet street 11, tortuga", "calvary road 9", "name", "hogson, michael",
"jobs, steve", "just", "stuff", "nothing", "interesting", "date",
"06.05.2022", "number", "2/1575", "adress", "bowland road 2, mexiko",
"name", "michael myers", "terry jones", "olivia wilde", "just",
"stuff", "nothing", "interesting", "date", "06.05.2022", "number",
"1/93375", "adress", "sunset boulevard", "name", "harrison ford")

相同的模式总是重复,我想要一个像这样的数据框:

<表类=“s-表”><标题>日期数字地址姓名 <正文>2022年5月6日5/7283南街11号、远方、西街4号约翰·斯诺2022年5月6日7/112563island street 348, Planet street 11, tortuga, calvary road 9霍 Gson 、迈克尔、乔布斯、史蒂夫

总是有确切的一个日期、一个数字、一个或多个地址以及一个或多个姓名。 “只是没什么有趣的东西”也总是相同的,并且可以可靠地用于检测名称的结尾。

我想这可以通过循环来实现,但我放弃了尝试。或者有没有一个函数可以处理这种异常情况? (甚至不确定长度是否是正确的词,我希望我的意思很清楚......)

最佳答案

在 Base R 中,您可以将文本重写为有效的 DCF 并读入。

x <- paste(text, collapse = ' ')
x <- gsub('just stuff nothing interesting', '', x)
x <- gsub('(name|number|adress)', '\n\\1:', x)
x <- gsub("date", "\n\ndate:", x)
read.dcf(textConnection(x), all = TRUE)

date number adress name
1 06.05.2022 1/3892 north street 45 peter miller
2 06.05.2022 5/7283 south street 11, fareaway west street 4 john snow
3 06.05.2022 7/112563 island street 348 planet street 11, tortuga calvary road 9 hogson, michael jobs, steve
4 06.05.2022 2/1575 bowland road 2, mexiko michael myers terry jones olivia wilde
5 06.05.2022 1/93375 sunset boulevard harrison ford

请注意,您可以运行 cat(x) 来查看有效的 DCF 是什么样的

使用 tidyverse:

text %>%
str_replace("^(number|name|adress|date)", "\n\\1:") %>%
str_replace("^(\ndate)", "\n\\1")%>%
str_c(collapse = " ")%>%
str_remove_all("just stuff nothing interesting") %>%
textConnection()%>%
read.dcf(all = TRUE)

关于R:将不规则长度的文本转换为数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72144999/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com