正则表达式——将一列拆分为多列，在 R 中没有明确的分隔符-6ren

正则表达式——将一列拆分为多列，在 R 中没有明确的分隔符

转载作者：行者123 更新时间：2023-12-01 09:52:04

我的数据集中有一列名为“Market.Pair”，其中包含有关某些航类的出发地和目的地点的信息。例如:

input <- data.frame(Market.Pair = c("US to/from CA", "HOU to/from DFW/DAL", "EWR/JFK to/from LAX/SFO", "US-NYC to/from FR-PAR", "US to/from Asia"))
input

所有两个字母的单词代表国家(例如美国、加拿大)。所有三个字母的单词(或多个用“/”分隔的三个字母的单词)代表机场(例如 HOU、DFW/DAL)。XX-XXX 形式的所有单词都代表城市(例如 US-NYC)。其他词代表地区，例如亚洲或欧洲。

我想将这一列分成多列:

output<- data.frame(Air.1 = c("HOU", "EWR/JFK", "", "", ""), Air.2 = c("DFW/DAL", "LAX/SFO", "", "", ""), City.1 = c("","","US-NYC", "", ""), City.2 = c("","","FR-PAR", "", ""), Country.1 = c("","","","US", "US"), Coutry.2 = c("","","","CA", ""), Region.1 = c("","","", "", "Asia"), Region.2 = c("","","", "", ""))
output

我是正则表达式的新手，所以非常感谢任何帮助!

最佳答案

这是一种相当手动的方法，但它应该仍然非常有效。它使用我的“splitstackshape”包中的 cSplit 来拆分列，然后使用“data.table”按条件进行子集化以通过引用创建新值。最后，它使用 dcast(同样来自“data.table”)进入宽格式。

这是一些新的示例数据，其中包含您在评论中描述的条件。

input <- data.frame(
  Market.Pair = c(
    "US to/from CA", "HOU to/from DFW/DAL",            # Your sample data
    "EWR/JFK to/from LAX/SFO", 
    "US-NYC to/from FR-PAR", "US to/from Asia", 
    "Latin America/Mexico to EMEA/India",              # Some only "to", exception to "/"
    "EWR to HKG/NRT, JFK to HKG"))                     # Some > 1 pair of values per row

这是一种可能的方法:

library(splitstackshape)
## First, take care of data combined in single rows
x <- cSplit(input, "Market.Pair", ",", "long")

## Add indicator for row names
x[, rn := 1:nrow(x)]

## Split on to/from or to
x <- cSplit(x, "Market.Pair", " to/from | to ", "long", fixed = FALSE, 
            stripWhite = FALSE, type.convert = FALSE)

## Add a column named "type" filled with 'Region' as the value
x[, type := "Region"]

## Using your defined conditions, you can replace the values in the
##   'type' column by reference. Here's 'Air'...
x[nchar(Market.Pair) == 3 | grepl("^.../...$", Market.Pair), type := "Air"]

## ... here's 'Country'
x[nchar(Market.Pair) == 2, type := "Country"]

## ... and here's 'City'
x[grepl("^..-...$", Market.Pair), type := "City"]

## Add an indicator variable...
x[, ind := sequence(.N), by = .(rn, type)]

现在，您可以使用“data.table”中的 dcast 将数据 reshape 为宽格式

dcast(x, rn ~ type + ind, value.var = "Market.Pair", fill = "")
#    rn   Air_1   Air_2 City_1 City_2 Country_1 Country_2             Region_1   Region_2
# 1:  1                                      US        CA                                
# 2:  2     HOU DFW/DAL                                                                  
# 3:  3 EWR/JFK LAX/SFO                                                                  
# 4:  4                 US-NYC FR-PAR                                                    
# 5:  5                                      US                           Asia           
# 6:  6                                                   Latin America/Mexico EMEA/India
# 7:  7     EWR HKG/NRT                                                                  
# 8:  8     JFK     HKG

关于正则表达式——将一列拆分为多列，在 R 中没有明确的分隔符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36001436/