- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在尝试使用 agrep 找出两个字符串名称之间模糊字符串匹配的最佳精度。
但是,由于字符串数量巨大,我需要选择一个精度“max.distance”以将其应用于我尝试匹配的所有字符串。不可能为我尝试匹配的每个字符串选择最佳精度值“max.distance”。
例如,假设我对每个“BANK OF AMERICA CORP”和“1st Capital Bank”使用精度“max.distance”作为“0.2”、“0.1”和“0.05”。
首先,以下是“BANK OF AMERICA CORP”的“max.distance”为“0.2”、“0.1”和“0.05”的情况:
> agrep("BANK OF AMERICA CORP",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.2)
[1] "BANK OF AMERICA/PRIVATE BANK WEST" "BANK OF AMERICA SECURITIES"
[3] "BANK OF AMERICA SEC LLC" "BANK OF AMERICA SECURITIES LLC"
[5] "BANK OF AMERICA NT & SA" "BANK OF AMERICA CORP"
[7] "ALLIANZ OF AMERICA CORP" "Bank of America Securities/Vice Pre"
[9] "Bank of America Securities/Investme" "Bank of America/President"
[11] "Bank of America Securities LLC/Prin" "Bank of America Securities LLC/Mana"
[13] "Bank of America Securities LLC/Inve" "Bank of America Securities/Principa"
[15] "Bank of America Securities LLC/Bank" "Bank of America Sec/Investment Bank"
[17] "Bank Of America Securities/Managing" "Bank of America/Chairman--Midwest A"
[19] "Bank of America Securities LLC/Vice" "Bank of America Corporation/Sales C"
[21] "Bank of America Securities/Broker" "Bank of America Corporation/Banker"
[23] "Bank of America Corporation/Senior" "Bank of America Securities/Equity R"
[25] "Bank of America Corporation/Vice Ch" "BANK OF AMERICA CORPORATION"
[27] "BANK OF AMERICA HEADQUARTERS" "BANK OF AMERICA ADMINISTRATION"
[29] "BANK OF AMERICA N A" "Bank of America/Commercial Banking"
[31] "Bank of America Sec./Investment Ban"
>
> agrep("BANK OF AMERICA CORP",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.1)
[1] "BANK OF AMERICA CORP" "ALLIANZ OF AMERICA CORP"
[3] "Bank of America Corporation/Sales C" "Bank of America Corporation/Banker"
[5] "Bank of America Corporation/Senior" "Bank of America Corporation/Vice Ch"
[7] "BANK OF AMERICA CORPORATION"
>
> agrep("BANK OF AMERICA CORP",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.05)
[1] "BANK OF AMERICA CORP" "Bank of America Corporation/Sales C"
[3] "Bank of America Corporation/Banker" "Bank of America Corporation/Senior"
[5] "Bank of America Corporation/Vice Ch" "BANK OF AMERICA CORPORATION"
下面是“最大距离”为“0.2”、“0.1”和“0.05”的“第一资本银行”:
> agrep("1st Capital Bank",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.2)
[1] "HURST CAPITAL PARTNERS"
[2] "SOY CAPITAL BANK"
[3] "FIRST CAPITOL BANK OF VICTOR"
[4] "OSTERWEIS CAPITAL MANAGEMENT"
[5] "1ST NATIONAL BANK"
[6] "FIRST CAPITAL BANK"
[7] "SEATTLE 1ST NAT'L BANK"
[8] "FIELD POINT CAPITAL MANAGEMENT"
[9] "SUMMERSET CAPITAL MANAGEMENT"
[10] "AMERIQUEST CAPITAL ASSOC"
[11] "BB&T CAPITAL MARKETS"
[12] "HUGHES CAPITAL MANAGEMENT"
[13] "WELLS CAPITAL MANAGEMENT"
[14] "SUPERIOR ST CAPITAL ADVISORS"
[15] "ORMES CAPITAL MARKETS INC"
[16] "1ST NAT'L BANK OF IL"
[17] "ADVENT CAPITAL MANAGEMENT"
[18] "1ST CAPITOL BANK"
[19] "BIONDI REISS CAPITAL MANAGEMENT"
[20] "CCYBYS CAPITAL MARKETS"
[21] "SEACOAST CAPITAL PARTNERS"
[22] "DOUGLAS CAPITAL MANAGEMENT"
[23] "HIGHFIELDS CAPITAL MANAGEMENT"
[24] "PRECEPT CAPITAL MANAGEMENT LP"
[25] "AUGUST CAPITAL MANAGEMENT"
[26] "SAKSA CAPITAL MANAGEMENT"
[27] "IMS CAPITAL MANAGEMENT"
[28] "TRENT CAPITAL MANAGEMENT"
[29] "Ormes Capital Management"
[30] "GARNET CAPITAL MANAGEMENT LLC"
[31] "INTERFASE CAPITAL MANAGERS"
[32] "RJS CAPITAL MANAGEMENT INC"
[33] "1ST NATIONAL BANK OF DE KALB"
[34] "1ST NAT'L BANK OF PHILLIPS CO"
[35] "1ST NAT'L BANK OF OKLAHOMA"
[36] "PROGRESS CAPITAL MANAGEMENT INC"
[37] "CAPITAL BANK & TRUST"
[38] "1ST NATL BANK"
[39] "ASB Capital Management/Real Estate"
[40] "Sears Capital Management"
[41] "Osterweis Capital Management/Invest"
[42] "Cerberus Capital Management LP/Asse"
[43] "LVS Capital Management/President"
[44] "1st Central Bank/Banker"
[45] "Summit Capital Management"
[46] "Orwes Capital Markets/Stockbroker"
[47] "Ormes Capital Management/Investment"
[48] "Nevis Capital Management/Investment"
[49] "Duncan Hurst Capital Management"
[50] "Progress Capital Management/Preside"
[51] "Cerberus Capital Management LP"
[52] "Wit Capital/Banker"
[53] "Ormes Capital Markets Inc."
[54] "Ormes Capital Markets/President & C"
[55] "Berents & Hess Capital Management"
[56] "Progress Capital Management/Venture"
[57] "First Capital Bank of KY"
[58] "Foothill Capital/Banker"
[59] "Pequot Capital Management/Equity Re"
[60] "First Dominion Capital/Banking"
[61] "Greenwhich Capital/Banker"
[62] "Veritas Capital Management/Banker"
[63] "Veritas Capital Management/Investme"
[64] "Lesese Capital Management/Investmen"
[65] "Douglas Capital Management/Investme"
[66] "FIRST NATINAL BANK OF AMARILLO"
[67] "NEVIS CAPITAL MANAGEMENT"
[68] "VERITAS CAPITAL MANAGEMENT"
[69] "SIEBERT CAPITAL MARKETS"
[70] "HOURGLASS CAPITAL MANAGEMENT"
[71] "1ST NATIONAL BANK DALHART"
[72] "TEXAS CAPITAL BANK"
[73] "NICHOLAS CAPITAL MANAGEMENT"
[74] "CERBUS CAPITAL MANAGEMENT"
[75] "CROESUS CAPITAL MANAGEMENT"
[76] "EAST WEST CAPITAL ASSOCIATES INC"
[77] "PRENDERGAST CAPITAL MANAGEMENT"
[78] "NANTUCKET CAPITAL MANAGEMENT"
[79] "1ST NATIONAL BANK TEMPLE"
[80] "ENTRUST CAPITAL INC"
[81] "1ST NATIONAL BANK OF IL"
[82] "SIMMS CAPITAL MANAGEMENT"
[83] "FIRST CAPITAL ADVISORS"
[84] "FIRST CAPITAL MANAGEMENT LTD"
[85] "1ST NATIONAL BANK & TRUST"
[86] "PENTECOST CAPITAL MANAGEMENT INC"
[87] "EAST-WEST CAPITAL ASSOCIATES"
[88] "1ST NAT'L BANK OF JOLIET"
[89] "FIRST CAPITOL BANK OF VICTO"
[90] "FIRST CAPITAL FINANCIAL"
[91] "PACIFIC COAST CAPITAL PARTNERS"
[92] "FIRST CAPITOL BANK"
[93] "FIRST CAPITAL ENGINEERING"
[94] "MIDWEST CAPITOL MANAGEMENT"
[95] "PEQUOT CAPITAL MANAGEMENT"
[96] "AGGOTT CAPITAL MANAGEMENT"
[97] "SIMMS CAPITAL MANAGEMENT INC"
[98] "PHILLIPS CAPITAL MANAGEMENT LLC"
[99] "1ST NATIONAL BANK OF COLD SP"
[100] "SOY CAPITOL BANK"
>
> agrep("1st Capital Bank",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.1)
[1] "FIRST CAPITOL BANK OF VICTOR" "FIRST CAPITAL BANK"
[3] "1ST CAPITOL BANK" "First Capital Bank of KY"
[5] "TEXAS CAPITAL BANK" "FIRST CAPITOL BANK OF VICTO"
[7] "FIRST CAPITOL BANK"
>
> agrep("1st Capital Bank",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.05)
[1] "FIRST CAPITAL BANK" "1ST CAPITOL BANK"
[3] "First Capital Bank of KY"
正如您所看到的,找到适用于每个字符串(例如“BANK OF AMERICA CORP”和“1st Capital Bank”)的“max.distance”的通用精度值确实很困难。此外,除了这两个之外,我还有更多的公司名称,这就是为什么我很难找到模糊字符串匹配的通用精度值和命令的原因。
C1999_0 的原始数据文件太大,无法附加,因此我认为仅使用如上所示的输出值就足以复制。
我知道有几个子类别需要操作,例如成本、替换、插入等,但它们与仅更改“max.distance”值本身没有太大区别。
如果我能获得这方面的帮助,我将不胜感激!
最佳答案
agrep
存在问题就是它就像 grep
如 help("grep")
中所述
Since someone who read the description carelessly even filed a bug report on it, do note that this matches substrings of each element of
x
(just asgrep
does) and not whole elements. See alsoadist
in package utils, which optionally returns the offsets of the matched substrings.
这似乎是您后一个示例中的问题,因为您有许多名称包含“资本”或“银行”或两者都包含。我要做的就是用来计算 Levenshtein distance (这就是 agrep
的作用,或者是通用版本,仅适用于子字符串)并取距离最短的。例如,
C1999 <- c("HURST CAPITAL PARTNERS", "SOY CAPITAL BANK", "FIRST CAPITOL BANK OF VICTOR", "OSTERWEIS CAPITAL MANAGEMENT", "1ST NATIONAL BANK", "FIRST CAPITAL BANK", "SEATTLE 1ST NAT'L BANK", "FIELD POINT CAPITAL MANAGEMENT", "SUMMERSET CAPITAL MANAGEMENT", "AMERIQUEST CAPITAL ASSOC", "BB&T CAPITAL MARKETS", "HUGHES CAPITAL MANAGEMENT", "WELLS CAPITAL MANAGEMENT", "SUPERIOR ST CAPITAL ADVISORS", "ORMES CAPITAL MARKETS INC", "1ST NAT'L BANK OF IL", "ADVENT CAPITAL MANAGEMENT", "1ST CAPITOL BANK", "BIONDI REISS CAPITAL MANAGEMENT", "CCYBYS CAPITAL MARKETS", "SEACOAST CAPITAL PARTNERS", "DOUGLAS CAPITAL MANAGEMENT", "HIGHFIELDS CAPITAL MANAGEMENT", "PRECEPT CAPITAL MANAGEMENT LP", "AUGUST CAPITAL MANAGEMENT", "SAKSA CAPITAL MANAGEMENT", "IMS CAPITAL MANAGEMENT", "TRENT CAPITAL MANAGEMENT", "Ormes Capital Management", "GARNET CAPITAL MANAGEMENT LLC", "INTERFASE CAPITAL MANAGERS", "RJS CAPITAL MANAGEMENT INC", "1ST NATIONAL BANK OF DE KALB", "1ST NAT'L BANK OF PHILLIPS CO", "1ST NAT'L BANK OF OKLAHOMA", "PROGRESS CAPITAL MANAGEMENT INC", "CAPITAL BANK & TRUST", "1ST NATL BANK", "ASB Capital Management/Real Estate", "Sears Capital Management", "Osterweis Capital Management/Invest", "Cerberus Capital Management LP/Asse", "LVS Capital Management/President", "1st Central Bank/Banker", "Summit Capital Management", "Orwes Capital Markets/Stockbroker", "Ormes Capital Management/Investment", "Nevis Capital Management/Investment", "Duncan Hurst Capital Management", "Progress Capital Management/Preside", "Cerberus Capital Management LP", "Wit Capital/Banker", "Ormes Capital Markets Inc.", "Ormes Capital Markets/President & C", "Berents & Hess Capital Management", "Progress Capital Management/Venture", "First Capital Bank of KY", "Foothill Capital/Banker", "Pequot Capital Management/Equity Re", "First Dominion Capital/Banking", "Greenwhich Capital/Banker", "Veritas Capital Management/Banker", "Veritas Capital Management/Investme", "Lesese Capital Management/Investmen", "Douglas Capital Management/Investme", "FIRST NATINAL BANK OF AMARILLO", "NEVIS CAPITAL MANAGEMENT", "VERITAS CAPITAL MANAGEMENT", "SIEBERT CAPITAL MARKETS", "HOURGLASS CAPITAL MANAGEMENT", "1ST NATIONAL BANK DALHART", "TEXAS CAPITAL BANK", "NICHOLAS CAPITAL MANAGEMENT", "CERBUS CAPITAL MANAGEMENT", "CROESUS CAPITAL MANAGEMENT", "EAST WEST CAPITAL ASSOCIATES INC", "PRENDERGAST CAPITAL MANAGEMENT", "NANTUCKET CAPITAL MANAGEMENT", "1ST NATIONAL BANK TEMPLE", "ENTRUST CAPITAL INC", "1ST NATIONAL BANK OF IL", "SIMMS CAPITAL MANAGEMENT", "FIRST CAPITAL ADVISORS", "FIRST CAPITAL MANAGEMENT LTD", "1ST NATIONAL BANK & TRUST", "PENTECOST CAPITAL MANAGEMENT INC", "EAST-WEST CAPITAL ASSOCIATES", "1ST NAT'L BANK OF JOLIET", "FIRST CAPITOL BANK OF VICTO", "FIRST CAPITAL FINANCIAL", "PACIFIC COAST CAPITAL PARTNERS", "FIRST CAPITOL BANK", "FIRST CAPITAL ENGINEERING", "MIDWEST CAPITOL MANAGEMENT", "PEQUOT CAPITAL MANAGEMENT", "AGGOTT CAPITAL MANAGEMENT", "SIMMS CAPITAL MANAGEMENT INC", "PHILLIPS CAPITAL MANAGEMENT LLC", "1ST NATIONAL BANK OF COLD SP", "SOY CAPITOL BANK")
func <- function(x, y, tol = 0L){
require(stringdist)
dista <- stringdist::stringdist(x, y, method = "lv")
min_dista <- min(dista)
y[dista <= min_dista + tol]
}
func("1st Capital Bank", C1999)
#R [1] "Wit Capital/Banker"
func("1st Capital Bank", C1999, 4L)
#R [1] "Wit Capital/Banker" "First Capital Bank of KY"
func("1st Capital Bank", C1999, 10L)
#R [1] "SOY CAPITAL BANK" "1ST NATIONAL BANK"
#R [3] "FIRST CAPITAL BANK" "1ST CAPITOL BANK"
#R [5] "Ormes Capital Management" "1ST NATL BANK"
#R [7] "Sears Capital Management" "1st Central Bank/Banker"
#R [9] "Summit Capital Management" "Wit Capital/Banker"
#R [11] "Ormes Capital Markets Inc." "First Capital Bank of KY"
#R [13] "Foothill Capital/Banker" "Greenwhich Capital/Banker"
#R [15] "TEXAS CAPITAL BANK" "FIRST CAPITOL BANK"
#R [17] "SOY CAPITOL BANK"
# ignoring cases
func <- function(x, y, tol = 0L){
require(stringdist)
dista <- stringdist::stringdist(tolower(x), tolower(y), method = "lv")
min_dista <- min(dista)
y[dista <= min_dista + tol]
}
func("1st Capital Bank", C1999, 0L)
#R [1] "1ST CAPITOL BANK"
tol
func
中的参数控制是否要包含 tol
的示例远离最小编辑距离。我发现我没有准确回答您的要求(如何使用 agrep
获得模糊字符串匹配的精确通用“max.distance”值?),但我认为我的答案可能是您正在寻找什么。
我使用stringdist::stringdist
而不是adist
因为前者似乎更快。它仍然可能有点慢,我希望有一个 R 包可以设置最大距离,但我还没有遇到过这样的包。这可以使(当时有上限的)Levenshtein 距离的计算变得更快。
关于r - 如何使用 agrep 获得模糊字符串匹配的精确公共(public) "max.distance"值?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52273813/
是否有内置方法来量化 agrep 的结果?功能?例如。在 agrep("test", c("tesr", "teqr", "toar"), max = 2, v=T) [1] "tesr" "teqr
尽管我将 max.distance 限制为零,为什么 agrep 会找到匹配项? adist 确实正确地告诉我,我需要两次插入... > agrep("ab", "abcd", max = list(
我在 R 中使用“agrep”函数,它返回匹配向量。我想要一个类似于 agrep 的函数,它只返回最佳匹配,或者如果存在平局则返回最佳匹配。目前,我正在对结果向量的每个元素使用“cba”包中的“sdi
我有一个模式向量,需要对它们使用 agrep。问题是 agrep 似乎一次只采用一种模式。 patt 1 and only the first element will be used lapply
我的目标是确定是否给定 text有一个 target字符串,但我想允许拼写错误/小派生并提取“导致”匹配的子字符串(将其用于进一步的文本分析)。 示例: target <- "target strin
在?agrep(带有模糊匹配的grep)中,它提到我可以设置参数fixed=FALSE来解释我的模式作为正则表达式。 但是,我无法让它工作! agrep('(asdf|fdsa)', 'asdf',
我正在尝试做的是使用 agrep 获取文件中最匹配的词和它的错误数。现在我只能使用这个脚本来获取单词: array=(bla1 bla2 bla3) for eachWord in "${array[
我是一个 R 新手,一直在尝试使用 agrep 进行一些实验。 R 中的函数。我有一个庞大的客户数据库(150 万行),我确信其中有很多重复项。尽管使用 table() 来获取重复确切名称的频率,但并
我需要一些帮助来理解这些函数的参数。我从帮助中拿了例子。 ## To see the transformation counts for the Levenshtein distance: drop(
我正在尝试从 data.frame 转换至 data.table ,并且需要一些关于我正在尝试在单个列上执行的逻辑索引的建议。这是我有的一张表: places <- data.table(name=c
使用 R,我尝试匹配按年份和城市构建的数据集中的人名。由于一些拼写错误,无法精确匹配,因此我尝试使用 agrep() 来模糊匹配名称。 数据集的示例 block 的结构如下: df <- data.f
我有一个字符串: string <- "I do not like green eggs and ham!" 和一个图案 pattern <- "(egs|ham)" 我想知道多少次pattern匹配
我的Java程序需要启动agrep.exe,其参数包含大矩阵中所有元素对的参数,并获取两个字符串的匹配错误数。我写了一段代码,但是运行速度很慢。我可以加快这部分代码的速度吗?或者,也许你可以建议我一些
我正在尝试使用 agrep 找出两个字符串名称之间模糊字符串匹配的最佳精度。 但是,由于字符串数量巨大,我需要选择一个精度“max.distance”以将其应用于我尝试匹配的所有字符串。不可能为我尝试
编辑:这个错误是在 32 位版本的 R 中发现的,已在 R 版本 2.9.2 中得到修复。 这是@leoniedu 今天发给我的推特,我没有他的答案,所以我想我会把它贴在这里。 我已经阅读了 agre
我是一名优秀的程序员,十分优秀!