gpt4 book ai didi

r - 基于多列和阈值合并数据框

转载 作者:行者123 更新时间:2023-12-03 14:51:40 27 4
gpt4 key购买 nike

我有两个 data.frame 和多个公共(public)列(这里: datecityctry 和( other_ ) number )。

我现在想将它们合并到上述列中,但可以容忍某种程度的差异:

threshold.numbers <- 3
threshold.date <- 5 # in days

如果 date 条目之间的差异是 > threshold.date (以天为单位)或 > threshold.numbers ,我不希望合并这些行。
同样,如果 city 中的条目是 df 列中另一个 city 条目的子字符串,我希望合并这些行。 [如果有人有更好的想法来测试实际城市名称的相似性,我很乐意听到。] (并保留第一个 df 的条目 datecitycountry 但两者( other_ ) number 列和 df 中的所有其他列。

考虑以下示例:
df1 <- data.frame(date = c("2003-08-29", "1999-06-12", "2000-08-29", "1999-02-24", "2001-04-17",
"1999-06-30", "1999-03-16", "1999-07-16", "2001-08-29", "2002-07-30"),
city = c("Berlin", "Paris", "London", "Rome", "Bern",
"Copenhagen", "Warsaw", "Moscow", "Tunis", "Vienna"),
ctry = c("Germany", "France", "UK", "Italy", "Switzerland",
"Denmark", "Poland", "Russia", "Tunisia", "Austria"),
number = c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100),
col = c("apple", "banana", "pear", "banana", "lemon", "cucumber", "apple", "peach", "cherry", "cherry"))


df2 <- data.frame(date = c("2003-08-29", "1999-06-12", "2000-08-29", "1999-02-24", "2001-04-17", # all identical to df1
"1999-06-29", "1999-03-14", "1999-07-17", # all 1-2 days different
"2000-01-29", "2002-07-01"), # all very different (> 2 weeks)
city = c("Berlin", "East-Paris", "near London", "Rome", # same or slight differences
"Zurich", # completely different
"Copenhagen", "Warsaw", "Moscow", "Tunis", "Vienna"), # same
ctry = c("Germany", "France", "UK", "Italy", "Switzerland", # all the same
"Denmark", "Poland", "Russia", "Tunisia", "Austria"),
other_number = c(13, 17, 3100, 45, 51, 61, 780, 85, 90, 101), # slightly different to very different
other_col = c("yellow", "green", "blue", "red", "purple", "orange", "blue", "red", "black", "beige"))

现在,我想合并 data.frames 并收到一个 df,如果满足上述条件,则合并行。

(第一列只是为了您的方便:在表示原始大小写的第一个数字后面,它显示了合并的行( . )或行是来自 df1 ( 1 )还是 df2 ( 2 )。
          date        city        ctry number other_col other_number    other_col2          #comment
1. 2003-08-29 Berlin Germany 10 apple 13 yellow # matched on date, city, number
2. 1999-06-12 Paris France 20 banana 17 green # matched on date, city similar, number - other_number == threshold.numbers
31 2000-08-29 London UK 30 pear <NA> <NA> # not matched: number - other_number > threshold.numbers
32 2000-08-29 near London UK <NA> <NA> 3100 blue #
41 1999-02-24 Rome Italy 40 banana <NA> <NA> # not matched: number - other_number > threshold.numbers
42 1999-02-24 Rome Italy <NA> <NA> 45 red #
51 2001-04-17 Bern Switzerland 50 lemon <NA> <NA> # not matched: cities different (dates okay, numbers okay)
52 2001-04-17 Zurich Switzerland <NA> <NA> 51 purple #
6. 1999-06-30 Copenhagen Denmark 60 cucumber 61 orange # matched: date difference < threshold.date (cities okay, dates okay)
71 1999-03-16 Warsaw Poland 70 apple <NA> <NA> # not matched: number - other_number > threshold.numbers (dates okay)
72 1999-03-14 Warsaw Poland <NA> <NA> 780 blue #
81 1999-07-16 Moscow Russia 80 peach <NA> <NA> # not matched: number - other_number > threshold.numbers (dates okay)
82 1999-07-17 Moscow Russia <NA> <NA> 85 red #
91 2001-08-29 Tunis Tunisia 90 cherry <NA> <NA> # not matched: date difference < threshold.date (cities okay, dates okay)
92 2000-01-29 Tunis Tunisia <NA> <NA> 90 black #
101 2002-07-30 Vienna Austria 100 cherry <NA> <NA> # not matched: date difference < threshold.date (cities okay, dates okay)
102 2002-07-01 Vienna Austria <NA> <NA> 101 beige #

我尝试了合并它们的不同实现,但无法实现阈值。

编辑
为不清楚的表述道歉 - 我想保留所有行并接收指示该行是否匹配、不匹配和来自 df1 或不匹配和来自 df2。

伪代码是:
  if there is a case where abs("date_df2" - "date_df1") <= threshold.date:
if "ctry_df2" == "ctry_df1":
if "city_df2" ~ "city_df1":
if abs("number_df2" - "number_df1") <= threshold.numbers:
merge and go to next row in df2
else:
add row to df1```

最佳答案

我首先将城市名称转换为字符向量,因为(如果我理解正确的话)您想要包含包含在 df2.xml 中的城市名称。

df1$city<-as.character(df1$city)
df2$city<-as.character(df2$city)

然后按国家合并:
df = merge(df1, df2, by = ("ctry"))

> df
ctry date.x city.x number col date.y city.y other_number other_col
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue

图书馆 stringr将允许您在此处查看 city.x 是否在 city.y 内(请参见最后一列):
library(stringr)
df$city_keep<-str_detect(df$city.y,df$city.x) # this returns logical vector if city.x is contained in city.y (works one way)
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE

然后你可以得到日期之间的天数差异:
df$dayDiff<-abs(as.POSIXlt(df$date.x)$yday - as.POSIXlt(df$date.y)$yday)

以及数量上的差异:
df$numDiff<-abs(df$number - df$other_number)

以下是生成的数据框的样子:
> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
1 Austria 2002-07-30 Vienna 100 cherry 2002-07-01 Vienna 101 beige TRUE 29 1
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3
5 Italy 1999-02-24 Rome 40 banana 1999-02-24 Rome 45 red TRUE 0 5
6 Poland 1999-03-16 Warsaw 70 apple 1999-03-14 Warsaw 780 blue TRUE 2 710
7 Russia 1999-07-16 Moscow 80 peach 1999-07-17 Moscow 85 red TRUE 1 5
8 Switzerland 2001-04-17 Bern 50 lemon 2001-04-17 Zurich 51 purple FALSE 0 1
9 Tunisia 2001-08-29 Tunis 90 cherry 2000-01-29 Tunis 90 black TRUE 212 0
10 UK 2000-08-29 London 30 pear 2000-08-29 near London 3100 blue TRUE 0 3070

但是我们想删除在 city.y 中没有找到 city.x 的东西,其中日差大于 5 或​​数字差大于 3:
df<-df[df$dayDiff<=5 & df$numDiff<=3 & df$city_keep==TRUE,]

> df
ctry date.x city.x number col date.y city.y other_number other_col city_keep dayDiff numDiff
2 Denmark 1999-06-30 Copenhagen 60 cucumber 1999-06-29 Copenhagen 61 orange TRUE 1 1
3 France 1999-06-12 Paris 20 banana 1999-06-12 East-Paris 17 green TRUE 0 3
4 Germany 2003-08-29 Berlin 10 apple 2003-08-29 Berlin 13 yellow TRUE 0 3

剩下的是上面的三行(第 1 列中包含点)。

现在我们可以从 df2 中删除我们创建的三列以及日期和城市:
> df<-subset(df, select=-c(city.y, date.y, city_keep, dayDiff, numDiff))
> df
ctry date.x city.x number col other_number other_col
2 Denmark 1999-06-30 Copenhagen 60 cucumber 61 orange
3 France 1999-06-12 Paris 20 banana 17 green
4 Germany 2003-08-29 Berlin 10 apple 13 yellow

关于r - 基于多列和阈值合并数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58715919/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com