gpt4 book ai didi

r - 根据与字典中任何术语的匹配创建二进制是/否动物变量,R 中的 "animal"

转载 作者:行者123 更新时间:2023-12-02 19:04:16 26 4
gpt4 key购买 nike

继续这个问题:R: Create category column reflecting match between a dictionary and column in df我有一个大数据集“df”,有 30,000 行,以及两个大字典数据框:(1)动物,600k 行; (2)自然,30万行。

我只是想弄清楚如何根据 df$content 中的每一行是否与“animal”或“nature”字典匹配来创建两个简单的二进制变量“df$content_animal”和“df$content_nature” 。 (1=匹配,0=不匹配)。

以下是数据样本,我不可能在此处包含整个数据集:

df <- tibble(content= c("hello turkey feet blah blah blah", "i love rabbits haha", "wow this sunlight is amazing", "omg did u see the rainbow?!", "turtles like swimming in the water", "i love running across grassy lawns with my dog"))

animal=c("turkey", "rabbit", "turtle", "dog", "cat", "bear")
nature=c("sunlight", "water", "rainbow", "grass", "lawn", "mountain", "ice")

我已经尝试了基于多模式匹配的以下代码,但没有成功 - 我怀疑这是我的数据集和字典/模式的庞大规模的原因:

df$content_animal <- grepl(paste(animal,collapse="|"),df$content,ignore.case=TRUE)
df$content_nature <- grepl(paste(nature,collapse="|"),df$content,ignore.case=TRUE)

返回错误:

Error in grepl(paste(animal,collapse="|"), df$content,  : 
invalid regular expression, reason 'Out of memory' Error in grepl(paste(nature,collapse="|"), df$content, :
invalid regular expression, reason 'Out of memory'

我也尝试过:

df<-df %>%
mutate(
content_animal = case_when(grepl(animal, content) ~ "1")
)
df<-df %>%
mutate(
content_nature = case_when(grepl(nature, content) ~ "1")
)

返回错误:

Problem with `mutate()` input `content_animal`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
ℹ Input `content_animal` is `case_when(grepl(animal, content) ~ "1")`.argument 'pattern' has length > 1 and only the first element will be used
Problem with `mutate()` input `content_nature`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
ℹ Input `content_nature` is `case_when(grepl(nature, content) ~ "1")`.argument 'pattern' has length > 1 and only the first element will be used

我也尝试过

bench::mark(basic = mutate(df, content_animal = 1L*map_lgl(content, ~any(str_detect(.x, animal))),
content_nature = 1L*map_lgl(content, ~any(str_detect(.x, nature)))),
fixed = mutate(df, content_animal = 1L*map_lgl(content, ~any(str_detect(.x, fixed(animal)))),
content_nature = 1L*map_lgl(content, ~any(str_detect(.x, fixed(nature))))))

运行了两个多小时,没有给我任何输出。

我真的不知道我应该做什么。有人有什么想法吗?是否有更好的包或代码可用于我的大数据目的???

最佳答案

使用lapplyReduce循环可能会更好

Reduce(`|`, lapply(nature, function(x) grepl(x, df$content, ignore.case = TRUE)))
#[1] FALSE FALSE TRUE TRUE TRUE TRUE

相同
grepl(paste(nature,collapse="|"),df$content,ignore.case=TRUE)
#[1] FALSE FALSE TRUE TRUE TRUE TRUE

关于r - 根据与字典中任何术语的匹配创建二进制是/否动物变量,R 中的 "animal",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65259967/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com