gpt4 book ai didi

从向量中删除相似但更长的重复项

转载 作者:行者123 更新时间:2023-12-02 09:22:06 24 4
gpt4 key购买 nike

对于数据库清理,我有一个向量,比如说菜肴,我想删除“基础”菜肴的所有变体,只保留基础菜肴。例如,如果我有...

dishes <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
"HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA",
"PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE")

...我想删除向量中已具有较短匹配版本的所有条目。因此,生成的向量将仅包括:“DAL BHAT”、“HAMBURGER”、“PIZZA”。

使用嵌套的 for 循环并对照所有其他循环检查所有内容将适用于此示例,但对于手头的大型数据集来说会花费很长时间,而且我想说这是丑陋的编码。

可以假设所有条目都是大写的并且向量已经排序。不能假设下一个基菜的第一项总是比前一个条目短。

关于如何有效解决这个问题有什么建议吗?

额外问题:理想情况下,我只想从初始向量中删除项目,前提是它们比较短的对应项长至少 3 个字符。在上述情况下,这意味着“HAMBURGER2”也将保留在结果向量中。

最佳答案

这是我对此采取的方法。我将创建一个包含一些我需要考虑的条件的函数,并将其用于输入。我添加了注释来解释函数中发生的情况。

该函数有 4 个参数:

  • invec:输入字符向量。
  • thresh:我们可以使用多少个字符来确定“基础”菜肴。默认值 = 5。
  • minlen:您的“奖励”问题。默认 = 3。
  • 严格:符合逻辑。如果有 nchar 比您的 thresh 短的基础菜品,您是要降低 thresh 还是严格要求基础菜品?默认值 = FALSE。请参阅最后一个示例,了解 strict 的工作原理。
<小时/>
myfun <- function(invec, thresh = 5, minlen = 3, strict = FALSE) {
# Bookkeeping -- sort, unique, all upper case
invec <- sort(unique(toupper(invec)))
# More bookkeeping -- min should not be longer
# than min base dish unless strict = TRUE
thresh <- if (isTRUE(strict)) thresh else min(min(nchar(invec)), thresh)
# Use `thresh` to get the `stubs``
stubs <- invec[!duplicated(substr(invec, 1, thresh))]
# loop through the stubs and do two things:
# - Match the dish with the stub
# - Return the base dish and any dishes within the minlen
unlist(
lapply(stubs, function(x) {
temp <- grep(x, invec, value = TRUE, fixed = TRUE)
temp[temp == x | nchar(temp) <= nchar(x) + minlen]
}),
use.names = FALSE)
}

您的示例数据:

dishes <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
"HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA",
"PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE")

结果如下:

myfun(dishes, minlen = 0)
# [1] "DAL BHAT" "HAMBURGER" "PIZZA"

myfun(dishes)
# [1] "DAL BHAT" "HAMBURGER" "HAMBURGER2" "PIZZA"

这里还有一些示例数据。请注意,在“dishes2”中,数据不再排序,并且有一个新项目“DAL”,在“dishes3”中,您还有小写的菜肴。

dishes2 <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
"HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA",
"PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "DAL")

dishes3 <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE",
"HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA",
"PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "DAL", "pizza!!")

这是这些向量的函数:

myfun(dishes2, 4)
# [1] "DAL" "HAMBURGER" "HAMBURGER2" "PIZZA"

myfun(dishes3)
# [1] "DAL" "HAMBURGER" "HAMBURGER2" "PIZZA" "PIZZA!!"

myfun(dishes3, strict = TRUE)
# [1] "DAL" "DAL BHAT" "HAMBURGER" "HAMBURGER2" "PIZZA" "PIZZA!!"

关于从向量中删除相似但更长的重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47894225/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com