gpt4 book ai didi

r - 按 LIKE 合并两个表,但合并整个字符串而不是字符串的一部分

转载 作者:行者123 更新时间:2023-12-03 01:17:56 25 4
gpt4 key购买 nike

这是我的第一篇文章/问题,所以请友善。我有一个像这样的数据框:

        id                             product
1 00109290 Wax Salt; Pepper
2 23243242 Wood Stuff
3 23242433 Magic Unicorn Powder and My Tears
4 23778899 gelatin
5 25887766 tin;
6 7786655 fart noises, and things
7 3432422 --spearmint bacon& hydrangia leaves

我有一个像这样的查找表:

        ingredients
1 wax
2 salt
3 wood
4 my tears
5 unicorn powder
6 gelatin
7 tin
8 hydrangia leaves
9 spearmint
10 bacon

我想将它们合并到整个字符串上,所以我得到这个:

     id                             product      ingredients
1 00109290 Wax Salt; Pepper wax
2 00109290 Wax Salt; Pepper salt
3 23243242 Wood Stuff wood
4 23242433 Magic Unicorn Powder and My Tears my tears
5 23242433 Magic Unicorn Powder and My Tears unicorn powder
6 23778899 gelatin gelatin
7 25887766 tin; tin
8 3432422 --spearmint bacon& hydrangia leaves hydrangia leaves
9 3432422 --spearmint bacon& hydrangia leaves spearmint
10 3432422 --spearmint bacon& hydrangia leaves bacon

相反,我得到了这个(注意第 7 行不需要):

         id                             product      ingredients
1 00109290 Wax Salt; Pepper wax
2 00109290 Wax Salt; Pepper salt
3 23243242 Wood Stuff wood
4 23242433 Magic Unicorn Powder and My Tears my tears
5 23242433 Magic Unicorn Powder and My Tears unicorn powder
6 23778899 gelatin gelatin
7 23778899 gelatin tin
8 25887766 tin; tin
9 3432422 --spearmint bacon& hydrangia leaves hydrangia leaves
10 3432422 --spearmint bacon& hydrangia leaves spearmint
11 3432422 --spearmint bacon& hydrangia leaves bacon

我非常接近,但我错误地将“明胶”与“锡”匹配。我想匹配整个单词,而不是单词的一部分。我尝试了很多不同的技术,最接近的是:

library(sqldf)
id <- c('00109290', '23243242', '23242433',
'23778899', '25887766', '7786655',
'3432422')
product <- c('Wax Salt; Pepper', 'Wood Stuff',
'Magic Unicorn Powder and My Tears',
'gelatin', 'tin;', 'fart noises, and things',
'--spearmint bacon& hydrangia leaves')

ingredients <- c('wax', 'salt', 'wood', 'my tears',
'unicorn powder', 'gelatin', 'tin',
'hydrangia leaves',
'spearmint', 'bacon')

products <- data.frame(id, product)
ingred <- data.frame(ingredients)
new_df <- sqldf("SELECT * from products
join ingred on product LIKE '%' || ingredients || '%'")

非常感谢任何建议。也许需要一种完全不同的方法?我也欢迎有关问题质量的建议,这是我的第一次,所以你最好立即给我答复。

最佳答案

使用 fuzzyjoin 包和 stringr 中的 str_detect 的解决方案:

library(fuzzyjoin)
library(stringr)

f <- function(x, y) {
# tests whether y is an ingredient of x
str_detect(x, regex(paste0("\\b", y, "\\b"), ignore_case = TRUE))
}

fuzzy_join(products,
ingred,
by = c("product" = "ingredients"),
match_fun = f)
# id product ingredients
# 1 109290 Wax Salt; Pepper wax
# 2 109290 Wax Salt; Pepper salt
# 3 23243242 Wood Stuff wood
# 4 23242433 Magic Unicorn Powder and My Tears my tears
# 5 23242433 Magic Unicorn Powder and My Tears unicorn powder
# 6 23778899 gelatin gelatin

数据

products <- read.table(text = "
id product
1 00109290 'Wax Salt; Pepper'
2 23243242 'Wood Stuff'
3 23242433 'Magic Unicorn Powder and My Tears'
4 23778899 gelatin
", stringsAsFactors = FALSE)

ingred <- read.table(text = "
ingredients
1 wax
2 salt
3 wood
4 'my tears'
5 'unicorn powder'
6 gelatin
7 tin
", stringsAsFactors = FALSE)
<小时/>

关于r - 按 LIKE 合并两个表,但合并整个字符串而不是字符串的一部分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44728871/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com