gpt4 book ai didi

r - 向量化 data.table 之类的、grepl 或类似的以进行大数据字符串比较

转载 作者:行者123 更新时间:2023-12-04 22:19:43 32 4
gpt4 key购买 nike

我需要检查一列中的字符串是否包含来自另一列同一行的所有行的相应(数字)值。

如果我只检查单个模式的字符串,那么使用 data.table 的 likegrepl 会很简单。但是,我的模式值对每一行都不同。

有一个有点相关的问题 here ,但与那个问题不同,我需要创建一个逻辑标志,指示模式是否存在。

假设这是我的数据集;

DT <- structure(list(category = c("administration", "nurse practitioner", 
"trucking", "administration", "warehousing", "warehousing", "trucking",
"nurse practitioner", "nurse practitioner"), industry = c("admin",
"truck", "truck", "admin", "nurse", "admin", "truck", "nurse",
"truck")), .Names = c("category", "industry"), class = "data.frame", row.names = c(NA,
-9L))
setDT(DT)
> DT
category industry
1: administration admin
2: nurse practitioner truck
3: trucking truck
4: administration admin
5: warehousing nurse
6: warehousing admin
7: trucking truck
8: nurse practitioner nurse
9: nurse practitioner truck

我想要的结果是这样的向量:
> DT
matches
1: TRUE
2: FALSE
3: TRUE
4: TRUE
5: FALSE
6: FALSE
7: TRUE
8: TRUE
9: FALSE

当然,1 和 0 与 TRUE 和 FALSE 一样好。

以下是我尝试过的一些不起作用的事情:
apply(DT,1,grepl, pattern = DT[,2], x = DT[,1])
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> apply(DT,1,grepl, pattern = DT[,1], x = DT[,2])
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

> grepl(DT[,2], DT[,1])
[1] FALSE

> DT[Vectorize(grepl)(industry, category, fixed = TRUE)]
category industry
1: administration admin
2: trucking truck
3: administration admin
4: trucking truck
5: nurse practitioner nurse

> DT[stringi::stri_detect_fixed(category, industry)]
category industry
1: administration admin
2: trucking truck
3: administration admin
4: trucking truck
5: nurse practitioner nurse

> for(i in 1:nrow(DT)){print(grepl(DT[i,2], DT[i,1]))}
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

> for(i in 1:nrow(DT)){print(grepl(DT[i,2], DT[i,1], fixed = T))}
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE

> DT[category %like% industry]
category industry
1: administration admin
2: administration admin
Warning message:
In grepl(pattern, vector) :
argument 'pattern' has length > 1 and only the first element will be used

最佳答案

在 OP 的代码中,没有使用 ,。因此,基于 data.table 方法,它将对与 i 索引对应的行进行子集化。

但是,如果我们指定 , 我们正在玩 j 并且我们得到逻辑向量作为结果

DT[, stri_detect_fixed(category, industry)]
#[1] TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE FALSE

假设,我们将它保存在 list 中,然后我们得到一个带有列的 data.table
DT[, list(match=stri_detect_fixed(category, industry))]

关于r - 向量化 data.table 之类的、grepl 或类似的以进行大数据字符串比较,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35660709/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com