gpt4 book ai didi

正则表达式——将一列拆分为多列,在 R 中没有明确的分隔符

转载 作者:行者123 更新时间:2023-12-01 09:52:04 25 4
gpt4 key购买 nike

我的数据集中有一列名为“Market.Pair”,其中包含有关某些航类的出发地和目的地点的信息。例如:

input <- data.frame(Market.Pair = c("US to/from CA", "HOU to/from DFW/DAL", "EWR/JFK to/from LAX/SFO", "US-NYC to/from FR-PAR", "US to/from Asia"))
input

所有两个字母的单词代表国家(例如美国、加拿大)。所有三个字母的单词(或多个用“/”分隔的三个字母的单词)代表机场(例如 HOU、DFW/DAL)。XX-XXX 形式的所有单词都代表城市(例如 US-NYC)。其他词代表地区,例如亚洲或欧洲。

我想将这一列分成多列:

output<- data.frame(Air.1 = c("HOU", "EWR/JFK", "", "", ""), Air.2 = c("DFW/DAL", "LAX/SFO", "", "", ""), City.1 = c("","","US-NYC", "", ""), City.2 = c("","","FR-PAR", "", ""), Country.1 = c("","","","US", "US"), Coutry.2 = c("","","","CA", ""), Region.1 = c("","","", "", "Asia"), Region.2 = c("","","", "", ""))
output

我是正则表达式的新手,所以非常感谢任何帮助!

最佳答案

这是一种相当手动的方法,但它应该仍然非常有效。它使用我的“splitstackshape”包中的 cSplit 来拆分列,然后使用“data.table”按条件进行子集化以通过引用创建新值。最后,它使用 dcast(同样来自“data.table”)进入宽格式。

这是一些新的示例数据,其中包含您在评论中描述的条件。

input <- data.frame(
Market.Pair = c(
"US to/from CA", "HOU to/from DFW/DAL", # Your sample data
"EWR/JFK to/from LAX/SFO",
"US-NYC to/from FR-PAR", "US to/from Asia",
"Latin America/Mexico to EMEA/India", # Some only "to", exception to "/"
"EWR to HKG/NRT, JFK to HKG")) # Some > 1 pair of values per row

这是一种可能的方法:

library(splitstackshape)
## First, take care of data combined in single rows
x <- cSplit(input, "Market.Pair", ",", "long")

## Add indicator for row names
x[, rn := 1:nrow(x)]

## Split on to/from or to
x <- cSplit(x, "Market.Pair", " to/from | to ", "long", fixed = FALSE,
stripWhite = FALSE, type.convert = FALSE)

## Add a column named "type" filled with 'Region' as the value
x[, type := "Region"]

## Using your defined conditions, you can replace the values in the
## 'type' column by reference. Here's 'Air'...
x[nchar(Market.Pair) == 3 | grepl("^.../...$", Market.Pair), type := "Air"]

## ... here's 'Country'
x[nchar(Market.Pair) == 2, type := "Country"]

## ... and here's 'City'
x[grepl("^..-...$", Market.Pair), type := "City"]

## Add an indicator variable...
x[, ind := sequence(.N), by = .(rn, type)]

现在,您可以使用“data.table”中的 dcast 将数据 reshape 为宽格式

dcast(x, rn ~ type + ind, value.var = "Market.Pair", fill = "")
# rn Air_1 Air_2 City_1 City_2 Country_1 Country_2 Region_1 Region_2
# 1: 1 US CA
# 2: 2 HOU DFW/DAL
# 3: 3 EWR/JFK LAX/SFO
# 4: 4 US-NYC FR-PAR
# 5: 5 US Asia
# 6: 6 Latin America/Mexico EMEA/India
# 7: 7 EWR HKG/NRT
# 8: 8 JFK HKG

关于正则表达式——将一列拆分为多列,在 R 中没有明确的分隔符,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36001436/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com