gpt4 book ai didi

r - R中多个变量的条件合并

转载 作者:行者123 更新时间:2023-12-04 08:52:58 28 4
gpt4 key购买 nike

请帮我找出一种有效的方法来合并这两个数据帧而不使用 for 循环。还有更多的列和行,但我简化了此示例的数据。
我期待:

  • 左连接,保持 df 行完整,并从查找中引入 D 列。
  • 基于两列联接。
  • 首先使用模糊匹配检查列 x。我想从 df 中获取 x 并查看查找中的任何 x 是否是部分字符串匹配(查找 x 字符串在 df x 字符串内)。如果没有匹配项,那么我希望它使用“All Else”x 变量。
  • 然后在选择 x 变量后,我想检查 y 变量是否完全匹配并返回 D 变量。


  • 这是我开始的两个表:
    df = structure(list(x = c("San Francisco", "Work at Home", "Arlington VA", 
    "Work at Home", "Arlington"), y = c(1, 5, 1, 6, 2)), row.names = c(NA,
    -5L), class = c("tbl_df", "tbl", "data.frame"))

    lookup = structure(list(x = c("Arlington", "Arlington", "Arlington", "Arlington",
    "Arlington", "Arlington", "Arlington", "Arlington", "Arlington",
    "Arlington", "Arlington", "Arlington", "Arlington", "Chicago",
    "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago",
    "Chicago", "Chicago", "Chicago", "Chicago", "Chicago", "Chicago",
    "San Diego", "San Diego", "San Diego", "San Diego", "San Diego",
    "San Diego", "San Diego", "San Diego", "San Diego", "San Diego",
    "San Diego", "San Diego", "San Diego", "Lisle", "Lisle", "Lisle",
    "Lisle", "Lisle", "Lisle", "Lisle", "Lisle", "Lisle", "Lisle",
    "Lisle", "Lisle", "Lisle", "Brea", "Brea", "Brea", "Brea", "Brea",
    "Brea", "Brea", "Brea", "Brea", "Brea", "Brea", "Brea", "Brea",
    "Boston", "Boston", "Boston", "Boston", "Boston", "Boston", "Boston",
    "Boston", "Boston", "Boston", "Boston", "Boston", "Boston", "Austin",
    "Austin", "Austin", "Austin", "Austin", "Austin", "Austin", "Austin",
    "Austin", "Austin", "Austin", "Austin", "Austin", "Dallas", "Dallas",
    "Dallas", "Dallas", "Dallas", "Dallas", "Dallas", "Dallas", "Dallas",
    "Dallas", "Dallas", "Dallas", "Dallas", "Miami", "Miami", "Miami",
    "Miami", "Miami", "Miami", "Miami", "Miami", "Miami", "Miami",
    "Miami", "Miami", "Miami", "Bedford", "Bedford", "Bedford", "Bedford",
    "Bedford", "Bedford", "Bedford", "Bedford", "Bedford", "Bedford",
    "Bedford", "Bedford", "Bedford", "All Else", "All Else", "All Else",
    "All Else", "All Else", "All Else", "All Else", "All Else", "All Else",
    "All Else", "All Else", "All Else", "All Else"), y = c(1, 2,
    3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8,
    9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
    1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6,
    7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
    13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5,
    6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
    12, 13, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4,
    5, 6, 7, 8, 9, 10, 11, 12, 13), D = c(0.88, 0.7, 0.19, 0.12,
    0.26, 0.68, 0.1, 1, 0.68, 0.96, 0.75, 0.08, 0.25, 0.3, 0.64,
    0.35, 0.94, 0.21, 0.15, 0.19, 0.84, 0.94, 0.03, 0.39, 0.42, 0.76,
    0.48, 0.71, 0.75, 0.87, 0.18, 0.53, 0.45, 0.1, 0.66, 0.01, 0.22,
    0.11, 0.79, 0.82, 0.11, 0.66, 0.91, 0.59, 0.55, 0.66, 0.29, 0.58,
    0.26, 0.36, 0.07, 0.47, 0.47, 0.45, 0.15, 0.07, 0.49, 0.67, 0.8,
    0.82, 0.89, 0.36, 0.3, 0.57, 0.44, 0.09, 0.59, 0.65, 0.12, 0.05,
    0.87, 0.47, 0.24, 0.17, 0.56, 0.13, 0.84, 0.17, 0.61, 0.73, 0.31,
    0.79, 0.64, 0.6, 0.63, 0.36, 0.41, 0.15, 0.79, 0.59, 0.2, 0.59,
    0.89, 0.46, 0.77, 0.79, 0.5, 0.99, 0.22, 0.77, 0.9, 0.86, 0.6,
    0.41, 0.95, 0.38, 0.86, 0.82, 0.68, 0.3, 0.75, 0.29, 0.16, 0.88,
    0.3, 0.53, 0.14, 0.23, 0.16, 0.88, 0.93, 0.63, 0.41, 0.72, 0.58,
    0.58, 0.63, 0.66, 0.98, 0.25, 0.68, 0.92, 0.67, 0.67, 0.11, 0.16,
    0.3, 0.36, 0.32, 0.66, 0.34, 0.89, 0.33)), row.names = c(NA,
    -143L), class = c("tbl_df", "tbl", "data.frame"))
    这是我想要的输出:
    output = structure(list(x = c("San Francisco", "Work at Home", "Arlington VA", 
    "Work at Home", "Arlington"), y = c(1, 5, 1, 6, 2), D = c(0.68,
    0.11, 0.88, 0.16, 0.7)), row.names = c(NA, -5L), class = c("tbl_df",
    "tbl", "data.frame"))

    最佳答案

    您可以使用 dplyrstringr这个问题的包。
    首先,您可以使用 distinct 为多个模式创建正则表达式。 , pullpaste .

    library(dplyr)
    library(stringr)

    xvec <- paste(paste0(paste0("\\b",lookup %>% distinct(x) %>% pull()),"\\b"), collapse = '|')

    >xvec
    [1] "\\bArlington\\b|\\bChicago\\b|\\bSan Diego\\b|\\bLisle\\b|\\bBrea\\b|\\bBoston\\b|\\bAustin\\b|\\bDallas\\b|\\bMiami\\b|\\bBedford\\b|\\bAll Else\\b"
    现在您可以使用 str_match stringr的功能包裹。 case_when用于在没有匹配的情况下将新列 xnew 更改为“All Else”。结果在 dfnew 表中。
    dfnew <- df %>%
    mutate(xnew=str_match(x, xvec)) %>%
    mutate(xnew=case_when(!is.na(xnew) ~ xnew, TRUE ~ "All Else"))

    >dfnew
    x y xnew
    <chr> <dbl> <chr>
    1 San Francisco 1 All Else
    2 Work at Home 5 All Else
    3 Arlington VA 1 Arlington
    4 Work at Home 6 All Else
    5 Arlington 2 Arlington
    最后,您可以加入表。为此,您将 xnew 和 y 分组为 dfnew 表,将 x 和 y 分组为查找表。在此之后,您将获得所需的输出。
    output <- dfnew %>%
    group_by(xnew,y) %>%
    left_join(lookup %>% group_by(x,y), by=c("xnew"="x","y"="y")) %>%
    ungroup() %>%
    select(-xnew)

    >output
    x y D
    <chr> <dbl> <dbl>
    1 San Francisco 1 0.68
    2 Work at Home 5 0.11
    3 Arlington VA 1 0.88
    4 Work at Home 6 0.16
    5 Arlington 2 0.7

    关于r - R中多个变量的条件合并,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63999961/

    28 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com