r - 使用模糊合并/sqldf 合并两个数据框-6ren

r - 使用模糊合并/sqldf 合并两个数据框

转载作者：行者123 更新时间：2023-12-04 17:09:08

28

4

我有以下数据框(df11 和 df22)我想使用“UserID=UserID”和日期差 <= 30 进行合并/完全连接。因此，如果 UserID 匹配且日期小于或等于 30，我希望它们合并为一个单独的行。我看过模糊连接 here和 sqldf here但我不知道如何为我的数据框实现其中任何一个。

df1 <- structure(list(UserID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L), 
                      Full.Name = c( "John Smith", "Jack Peters", "Bob Brown", "Jane Doe", "Jackie Jane", "Sarah Brown", "Chloe Brown", "John Smith" ), 
                      Info = c("yes", "no", "yes", "yes", "yes", "yes", "no", "yes"), 
                      EncounterID = c(13L, 14L, 15L, 16L, 17L, 18L, 19L, 13L), DateTime = c("1/2/21 00:00", "1/5/21 12:00", "1/1/21 1:31", "1/5/21 3:34", "5/9/21 5:33", "5/8/21 3:39", "12/12/21 2:30", "12/11/21 9:21"), 
                      Temp = c("100", "103", "104", "103", "101", "102", "103", "105"), 
 
                      misc = c("(null)", "no", "(null)", "(null)", "(null)","(null)", "(null)", "(null)" 
                                    )), 
                 class = "data.frame", row.names = c(NA, 
                                                     -8L))

df2 <- structure(list(UserID = c(1L, 2L, 3L, 4L, 5L, 6L), 
                      Full.Name = c("John Smith", "Jack Peters", "Bob Brown", "Jane Doe", "Jackie Jane", "Sarah Brown"), 
                      DOB = c("1/1/90", "1/10/90", "1/2/90", "2/20/80", "2/2/80", "12/2/80"), 
                      EncounterID = c(13L, 14L, 15L, 16L, 17L, 18L), EncounterDate = c("1/1/21", "1/2/21", "1/1/21", "1/6/21", "5/7/21", "5/8/21"), 
                      Type = c("Intro", "Intro", "Intro", "Intro", "Care", "Out"), 
                      responses = c("(null)", "no", 
                                    "yes", "no", "no", "unsat")), 
                      
                 class = "data.frame", row.names = c(NA, 
                                                     -6L))
loadedNamespaces()
install.packages("Rcpp")
library(dplyr)
library(tidyr)
install.packages("lubridate")
library(lubridate)

df11 <- 
df1 %>% 
  separate(DateTime, c("Date", "Time"), sep=" ") %>% 
  mutate(Date = as_datetime(mdy(Date))) %>% 
  select(-Time) %>% 
  as_tibble()

df22 <-
df2 %>% 
  mutate(across(c(EncounterDate), mdy)) %>% 
  mutate(across(c(EncounterDate), as_datetime)) %>% 
  as_tibble()

@r2evans 运行第一组代码后，我得到以下输出。这与你的略有不同。

df11 <- mutate(df11, Date_m30 = Date %m-% days(30), Date_p30 = Date %m+% days(30))
df11
# A tibble: 8 x 7
  UserID Full.Name   Info  EncounterID Date                Temp  misc  
   <int> <chr>       <chr>       <int> <dttm>              <chr> <chr> 
1      1 John Smith  yes            13 2021-01-02 00:00:00 100   (null)
2      2 Jack Peters no             14 2021-01-05 00:00:00 103   no    
3      3 Bob Brown   yes            15 2021-01-01 00:00:00 104   (null)
4      4 Jane Doe    yes            16 2021-01-05 00:00:00 103   (null)
5      5 Jackie Jane yes            17 2021-05-09 00:00:00 101   (null)
6      6 Sarah Brown yes            18 2021-05-08 00:00:00 102   (null)
7      7 Chloe Brown no             19 2021-12-12 00:00:00 103   (null)
8      1 John Smith  yes            13 2021-12-11 00:00:00 105   (null)

最佳答案

一种方法是首先在其中一个列中创建“+/- 30 天”列，然后进行标准的日期范围连接。使用 sqldf:

准备:

library(dplyr)
df11 <- mutate(df11, Date_m30 = Date %m-% days(30), Date_p30 = Date %m+% days(30))
df11
# # A tibble: 8 x 9
#   UserID Full.Name   Info  EncounterID Date                Temp  misc   Date_m30            Date_p30           
#    <int> <chr>       <chr>       <int> <dttm>              <chr> <chr>  <dttm>              <dttm>             
# 1      1 John Smith  yes            13 2021-01-02 00:00:00 100   (null) 2020-12-03 00:00:00 2021-02-01 00:00:00
# 2      2 Jack Peters no             14 2021-01-05 00:00:00 103   no     2020-12-06 00:00:00 2021-02-04 00:00:00
# 3      3 Bob Brown   yes            15 2021-01-01 00:00:00 104   (null) 2020-12-02 00:00:00 2021-01-31 00:00:00
# 4      4 Jane Doe    yes            16 2021-01-05 00:00:00 103   (null) 2020-12-06 00:00:00 2021-02-04 00:00:00
# 5      5 Jackie Jane yes            17 2021-05-09 00:00:00 101   (null) 2021-04-09 00:00:00 2021-06-08 00:00:00
# 6      6 Sarah Brown yes            18 2021-05-08 00:00:00 102   (null) 2021-04-08 00:00:00 2021-06-07 00:00:00
# 7      7 Chloe Brown no             19 2021-12-12 00:00:00 103   (null) 2021-11-12 00:00:00 2022-01-11 00:00:00
# 8      1 John Smith  yes            13 2021-12-11 00:00:00 105   (null) 2021-11-11 00:00:00 2022-01-10 00:00:00

连接:

sqldf::sqldf("
    select df11.*, df22.DOB, df22.EncounterDate, df22.Type, df22.responses
    from df11
      left join df22 on df11.UserID = df22.UserID
        and df22.EncounterDate between df11.Date_m30 and df11.Date_p30") %>%
  select(-Date_m30, -Date_p30)
#   UserID   Full.Name Info EncounterID                Date Temp   misc     DOB       EncounterDate  Type responses
# 1      1  John Smith  yes          13 2021-01-01 19:00:00  100 (null)  1/1/90 2020-12-31 19:00:00 Intro    (null)
# 2      2 Jack Peters   no          14 2021-01-04 19:00:00  103     no 1/10/90 2021-01-01 19:00:00 Intro        no
# 3      3   Bob Brown  yes          15 2020-12-31 19:00:00  104 (null)  1/2/90 2020-12-31 19:00:00 Intro       yes
# 4      4    Jane Doe  yes          16 2021-01-04 19:00:00  103 (null) 2/20/80 2021-01-05 19:00:00 Intro        no
# 5      5 Jackie Jane  yes          17 2021-05-08 20:00:00  101 (null)  2/2/80 2021-05-06 20:00:00  Care        no
# 6      6 Sarah Brown  yes          18 2021-05-07 20:00:00  102 (null) 12/2/80 2021-05-07 20:00:00   Out     unsat
# 7      7 Chloe Brown   no          19 2021-12-11 19:00:00  103 (null)    <NA>                <NA>  <NA>      <NA>
# 8      1  John Smith  yes          13 2021-12-10 19:00:00  105 (null)    <NA>                <NA>  <NA>      <NA>

关于r - 使用模糊合并/sqldf 合并两个数据框，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69856811/

28

4

0

文章推荐： sql - 如何进行条件连接

文章推荐： graph - Julia DiGraph 以用户定义的对象作为节点

文章推荐： r - 如何从图中获取从节点到节点的边权重之和

r sqldf 转义双引号
我希望在 sqldf() 中运行以下 sql 语句: select columnA, "new_column_value" as columnB, "column.C" from mytable wh
R - sqldf() 返回零行数据框
我正在尝试使用描述的方法从非常大的 csv 文件中读取选择数据的有效方法 here通过@JD_Long。该方法依赖于同名库中的 sqldf() 函数。我遇到的问题是该函数返回一个空数据框，该数据框具有
r - sqldf:按日期范围查询数据
我正在读取一个具有'%d/%m/%Y'日期格式的巨大文本文件。我想使用sqldf的read.csv.sql来同时读取和按日期过滤数据。这是为了通过跳过许多我不感兴趣的日期来节省内存使用量和运行时间。我
R-sqldf-需要明确的单位进行数字转换
我需要使用日期字段连接 2 个表 > class(pagos$pseudo_1mes) [1] "Date" > class(pseudo_meses$pseudo_1mes) [1] "Date"
mysql - sqldf 在使用子查询时出错
我正在尝试按工作日获取移动平均值，因为我正在使用 sql 查询。数据框是和sqldf代码: ma_782 = sqldf("SELECT t1.Id_indicator,
r 语言 - sqldf 包看不到我的任何数据文件
我已经在我的系统上全新安装了 sqldf 包，但是每当我运行任何 sql 查询时，我都会得到 Error in rsqlite_send_query(conn@ptr, statement) :
sql - 使用 SQLDF 从列中选择特定值
SQLDF 新手在这里。我有一个大约有 15,000 行和 1 列的数据框。数据看起来像: cars autocar carsinfo whatisthat donnadrive car tele
r - 使用模糊合并/sqldf 合并两个数据框
我有以下数据框(df11 和 df22)我想使用“UserID=UserID”和日期差 % separate(DateTime, c("Date", "Time"), sep=" ") %>%
r - sqldf 在排序时将数字列更改为字符 1
今天我发现了一个我无法解释的问题。这是众所周知的行为吗？数据集: structure(list(Original.Unit = c("some unit", "some unit", "some u
r - sqldf 中的 DATEPART()
是否可以在 sqldf 中使用 SQL Server 2008 DATEPART() 典型的 SQL 命令？我正在浏览文档，但没有找到任何与之相关的内容，我不熟悉 SQLite，所以如果我应该那样去
sql - 使用 sqldf 保存时间类
我正在使用 sqldf加入多个表，但我无法保存 times使用 chron 设置的类包上一列。我用了method="name__class" sqldf 的参数函数并用类适本地命名我的列，但我的 ti
r - sqldf、csv 和包含逗号的字段
我花了一段时间才弄清楚这一点。所以，我是answering my own question . 您有一些 .csv，您想要加载它 fast ，您想使用sqldf包裹。您常用的代码会被一些烦人的字段所困
r - LIKE sqldf 上的内连接
如何在 R 中使用 sqldf 将 LIKE 子句与内部联接一起使用？代码: Name <- c("Jack","Jill","Romeo") Name <- as.data.frame(Name)
r - 使用 SQLDF 的示例行
sqldf 有一个获取“X”行的限制选项。我们也可以使用 sqldf 做一个 'x%' 样本吗？例如 > sqldf("select * from iris limit 3") Loading re
r - 在 sqldf 中将整数值转换为日期时间
我正在使用 sqldf 库返回一个具有不同值的数据框，而且只有日期列的最大值。数据框看起来像这样 +------+----------+--------+-----------------+ | NA
r - 使用 sqldf() 选择匹配一百万个项目的行
这是对此处提供的有关使用 sqldf() 的答案的跟进。 https://stackoverflow.com/a/1820610 在我的特殊情况下，我有一个超过 1.1 亿行的制表符分隔文件。我想选择
mysql - sqldf RLIKE 函数
这个问题在这里已经有了答案: Regarding sqldf package/regexp function [duplicate] (1 个回答) 关闭 6 年前。我有以下 mySQL 查询:
python - pandasql::sqldf 不捕获循环变量
我试图用 pandasql::sqldf 循环列表，但这个 sqldf 似乎没有捕获循环变量。以下是我的问题的程式化概述: import pandas as pd from pandasql impo
r - 无法在 Linux 上安装 sqldf
我在 Linux 上运行 R 版本 2.14.1。当我尝试使用安装 sqldf 时 install.packages(sqldf, dependencies=TRUE) 我收到以下错误:(这些错误导
r - R 中的 SQLDF 左连接
我的目标是采用 'matr'，按列 c1 对其进行排序，并保持 unique(c1) where c2 = 1。例如，从这段代码... c1 = c("a",'a','a','b','b','b','

首页

博学

6Ren·AI

商城

r - 使用模糊合并/sqldf 合并两个数据框