gpt4 book ai didi

r - 在条件下加入两个数据帧(grepl)

转载 作者:行者123 更新时间:2023-12-02 16:05:16 27 4
gpt4 key购买 nike

我希望根据条件连接两个数据帧,在本例中,一个字符串在另一个字符串中。假设我有两个数据框,

df1 <- data.frame(fullnames=c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"), 
ages = c(30, 51, 45, 38, 20))

fullnames ages
1 Jane Doe 30
2 Mr. John Smith 51
3 Nate Cox, Esq. 45
4 Bill Lee III 38
5 Ms. Kate Smith 20

df2 <- data.frame(lastnames=c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages=c(30, 45, 20, 28, 51, 38),
homestate=c("NJ", "CT", "MA", "RI", "MA", "NY"))
lastnames ages homestate
1 Doe 30 NJ
2 Cox 45 CT
3 Smith 20 MA
4 Jung 28 RI
5 Smith 51 MA
6 Lee 38 NY

我想对这两个关于年龄的数据帧和 df2$lastnames 包含在 df1$fullnames 中的行进行左连接。我认为 fuzzy_join 可能会这样做,但我认为它不喜欢我的 grepl:

joined_dfs <- fuzzy_join(df1, df2, by = c("ages", "fullnames"="lastnames"), 
+ match_fun = c("=", "grepl()"),
+ mode="left")
Error in which(m) : argument to 'which' is not logical

期望的结果:一个与第一个相同但附加了“homestate”列的数据框。有什么想法吗?

最佳答案

长篇小说

你只需要修复match_fun:

# ...
match_fun = list(`==`, stringr::str_detect),
# ...

背景

您的想法是正确的,但是您对 fuzzyjoin::fuzzy_join() 中的 match_fun 参数的解释出错了.根据 documentation , match_fun 应该是一个

Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.

解决方案

一个简单的更正就可以解决问题,通过 dplyr 进一步格式化.为了概念清晰,我在排版上将 by 列与用于匹配它们的 function 对齐:

library(dplyr)

# ...
# Existing code
# ...

joined_dfs <- fuzzy_join(
df1, df2,

by = c("ages", "fullnames" = "lastnames"),
# |----| |-----------------------|
match_fun = list(`==` , stringr::str_detect ),
# |--| |-----------------|
# Match by equality ^ ^ Match by detection of `lastnames` in `fullnames`

mode = "left"
) %>%
# Format resulting dataset as you requested.
select(fullnames, ages = ages.x, homestate)

结果

鉴于您在此处复制的样本数据

df1 <- data.frame(
fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20)
)

df2 <- data.frame(
lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages = c(30, 45, 20, 28, 51, 38),
homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)

此解决方案应为 joined_dfs 生成以下 data.frame,按要求格式化:

        fullnames ages homestate
1 Jane Doe 30 NJ
2 Mr. John Smith 51 MA
3 Nate Cox, Esq. 45 CT
4 Bill Lee III 38 NY
5 Ms. Kate Smith 20 MA

注意事项

因为每个 ages 恰好是一个唯一的键,下面的连接仅 *names

fuzzy_join(
df1, df2,
by = c("fullnames" = "lastnames"),
match_fun = stringr::str_detect,
mode = "left"
)

将更好地说明匹配子字符串的行为:

       fullnames ages.x lastnames ages.y homestate
1 Jane Doe 30 Doe 30 NJ
2 Mr. John Smith 51 Smith 20 MA
3 Mr. John Smith 51 Smith 51 MA
4 Nate Cox, Esq. 45 Cox 45 CT
5 Bill Lee III 38 Lee 38 NY
6 Ms. Kate Smith 20 Smith 20 MA
7 Ms. Kate Smith 20 Smith 51 MA

哪里错了

类型错误

传递给 match_fun 的值应该是(symbol 的)一个 function

fuzzyjoin::fuzzy_join(
# ...
match_fun = grepl
# ...
)

list这样的(符号)函数:

fuzzyjoin::fuzzy_join(
# ...
match_fun = list(`=`, grepl)
# ...
)

而不是提供符号列表

match_fun = list(=, grepl)

您错误地提供了 vectorcharacter字符串:

match_fun = c("=", "grepl()")

语法错误

用户应该命名函数

`=`
grepl

然而你错误地试图调用他们:

=
grepl()

命名它们会将函数自身传递给match_fun,如预期的那样,而调用它们会传递它们的返回值*。在 R 中,像 = 这样的运算符使用反引号命名:`=`

* 假设调用没有因错误而失败。在这里,他们失败。

不适当的功能

要比较两个值是否相等,这里是character 向量df1$fullnamesdf2$lastnames,您应该使用关系运算符 == ;但是您错误地提供了赋值 运算符 = .

此外 grepl()没有完全按照 match_fun 期望的方式进行矢量化。而它的第二个argument (x) 确实是一个向量

a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

它的第一个argument (pattern) 是(被视为)单个 character 字符串:

character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.

因此,grepl() 不是

Vectorized function given two columns...

而是给定一个字符串(标量)和一列(向量)字符串的函数

你祈祷的答案不是 grepl() 而是类似 stringr::str_detect() 的东西,也就是

Vectorised over string and pattern. Equivalent to grepl(pattern, x).

并且包装stringi::stri_detect() .

注意事项

因为您只是想检测 df1$fullnames 中的 literal 字符串是否包含 df2$ 中的 literal 字符串lastnames,您不想意外地将 df2$lastnames 中的字符串视为 regular expression 模式。现在,您的 df2$lastnames 列在统计上不太可能包含具有特殊正则表达式字符的名称; - 是唯一的异常(exception),它在 [] 之外按字面解释, 极不可能在名称中找到。 p>

如果您仍然担心意外的正则表达式,您可能需要考虑 alternative search methodsstringi::stri_detect_fixed()stringi::stri_detect_coll() .这些分别通过 byte 执行文字匹配或 "canonical equivalence" ;后者根据语言环境和特殊字符进行调整,以与自然语言处理保持一致。

关于r - 在条件下加入两个数据帧(grepl),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69574373/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com