gpt4 book ai didi

R将字符串转换为数据框

转载 作者:行者123 更新时间:2023-12-05 02:31:24 24 4
gpt4 key购买 nike

这是我拥有的较大字符串的一小部分示例(无空格)。它包含个人的虚构细节。

每个个体由分隔。每个个体有10个属性。

txt = "EREKSON(Andrew,Hélène),female10/06/2011@Geneva(Switzerland),PPF,2000X007707,dist.093,Dt.043/996.BOUKAR(Mohamed,El-Hadi),male04/12/1956@London(England),PPF,2001X005729,dist.097,Dt.043/997.HARIMA(Olak,N’nassik,Gerad,Elisa,Jeremie),female25/06/2013@Paris(France),PPF,2009X005729,dist.088,Dt.043/998.THOMAS(Hajil,Pau,Joëli),female03/03/1980@Berlin(Germany),VAT,2010X006016,dist.078,Dt.043/999."

我想将其解析为一个数据框,其中包含与个体一样多的观察结果,每个变量有 10 列。

我试过使用正则表达式并查看了 stackoverflow 上的其他文本提取解决方案,但无法获得我想要的输出。

这是我想到的最终数据框,基于字符串输入——

result = data.frame(first_names = c('Hélène Andrew','Mohamed El-Hadi','Olak N’nassik Gerad Elisa Jeremie','Joëli Pau Hajil'),
family_name = c('EREKSON','BOUKAR','HARIMA','THOMAS'),
gender = c('male','male','female','female'),
birthday = c('10/06/2011','04/12/1956','25/06/2013','03/03/1980'),
birth_city = c('Geneva','London','Paris','Berlin'),
birth_country = c('Switzerland','England','France','Germany'),
acc_type = c('PPF','PPF','PPF','VAT'),
acc_num = c('2000X007707','2001X005729','2009X005729','2010X006016'),
district = c('dist.093','dist.097','dist.088','dist.078'),
code = c('Dt.043/996','Dt.043/997','Dt.043/998','Dt.043/999'))

任何帮助将不胜感激

最佳答案

这是一个使用 tidyr 函数 separate_rowsextract 的简洁解决方案:

library(tidyr)
data.frame(txt) %>%
# separate `txt` into rows using the dot `.` *if*
# preceded by `Dt\\.\\d{3}/\\d{3}` as splitting pattern:
separate_rows(txt, sep = "(?<=Dt\\.\\d{3}/\\d{3})\\.(?!$)") %>%
extract(
# select column from which to extract:
txt,
# define column names into which to extract:
into = c("family_name","first_names","gender",
"birthday","birth_city","birth_country",
"acc_type","acc_num","district","code"),
# describe the string exhaustively using capturing groups
# `(...)` to delimit what's to be extracted:
regex = "([A-Z]+)\\(([\\w,]+)\\),([a-z]+)([\\d/]+)@(\\w+)\\((\\w+)\\),([A-Z]+),(\\w+),dist.(\\d+),Dt\\.([\\d/]+)")
# A tibble: 4 × 10
family_name first_names gender birthday birth_city birth_country acc_type acc_num
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 EREKSON Andrew,Peter male 10/06/2011 Geneva Switzerland PPF 2000X007…
2 OBAMA Barack,Hussian male 04/12/1956 London England PPF 2001X005…
3 CLINTON Hillary female 25/06/2013 Paris France PPF 2009X005…
4 GATES Melinda female 03/03/1980 Berlin Germany VAT 2010X006…
# … with 2 more variables: district <chr>, code <chr>

关于R将字符串转换为数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71539049/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com