gpt4 book ai didi

R:将相似地址组合在一起

转载 作者:行者123 更新时间:2023-12-03 17:04:13 25 4
gpt4 key购买 nike

我有一个 400,000 行的文件,其中包含需要进行地理编码的手动输入地址。文件中的相同地址有很多不同的变体,因此多次对同一地址使用 API 调用似乎很浪费。
为了减少这种情况,我想减少这五行:

    Address
1 Main Street, Country A, World
1 Main St, Country A, World
1 Maine St, Country A, World
2 Side Street, Country A, World
2 Side St. Country A, World
下降到两个:
    Address
1 Main Street, Country A, World
2 Side Street, Country A, World
使用 stringdist您可以将字符串的“单词”部分组合在一起,但字符串匹配算法不区分数字。这意味着它将同一街道上的两个不同房屋号码归为同一地址。
为了解决这个问题,我想出了两种方法:首先,尝试使用正则表达式将数字和地址手动分离到单独的列中,然后重新加入它们。这样做的问题是,有这么多手动输入的地址,似乎有数百种不同的边缘情况,而且它变得笨拙。
grouping 上使用此答案这在 converting对于数字,我有第二种方法可以处理边缘情况,但在计算上非常昂贵。有没有更好的第三种方法来做到这一点?
library(gsubfn)
library(english)
library(qdap)
library(stringdist)
library(tidyverse)


similarGroups <- function(x, thresh = 0.8, method = "lv"){
grp <- integer(length(x))
Address <- x
x <- tolower(x)
for(i in seq_along(Address)){
if(!is.na(Address[i])){
sim <- stringdist::stringsim(x[i], x, method = method)
k <- which(sim > thresh & !is.na(Address))
grp[k] <- i
is.na(Address) <- k
}
}
grp
}

df <- data.frame(Address = c("1 Main Street, Country A, World",
"1 Main St, Country A, World",
"1 Maine St, Country A, World",
"2 Side Street, Country A, World",
"2 Side St. Country A, World"))

df1 <- df %>%
# Converts Numbers into Letters
mutate(Address = replace_number(Address),
# Groups Similar Addresses Together
Address = Address[similarGroups(Address, thresh = 0.8, method = "lv")],
# Converts Letters back into Numbers
Address = gsubfn("\\w+", setNames(as.list(1:1000), as.english(1:1000)), Address)
) %>%
# Removes the Duplicates
unique()

最佳答案

可能需要查看 OpenRefine 或 refinr R 的包,它的视觉效果要差得多,但仍然很好。它有两个功能,key_collision_mergen_gram_merge它有几个参数。如果你有一本好地址的字典,你可以把它传递给 key_collision_merge .
记下您经常看到的缩写(St.、Blvd.、Rd. 等)并替换所有这些缩写可能很好。这些缩写词的某个地方肯定有一个很好的表格,例如 https://www.pb.com/docs/US/pdf/SIS/Mail-Services/USPS-Suffix-Abbreviations.pdf .
然后:

library(refinr)    
df <- tibble(Address = c("1 Main Street, Country A, World",
"1 Main St, Country A, World",
"1 Maine St, Country A, World",
"2 Side Street, Country A, World",
"2 Side St. Country A, World",
"3 Side Rd. Country A, World",
"3 Side Road Country B World"))
df2 <- df %>%
mutate(address_fix = str_replace_all(Address, "St\\.|St\\,|St\\s", "Street"),
address_fix = str_replace_all(address_fix, "Rd\\.|Rd\\,|Rd\\s", "Road")) %>%
mutate(address_merge = n_gram_merge(address_fix, numgram = 1))

df2$address_merge
[1] "1 Main Street Country A, World"
[2] "1 Main Street Country A, World"
[3] "1 Main Street Country A, World"
[4] "2 Side Street Country A, World"
[5] "2 Side Street Country A, World"
[6] "3 Side Road Country A, World"
[7] "3 Side Road Country B World"

关于R:将相似地址组合在一起,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63836432/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com