gpt4 book ai didi

pattern-matching - 根据人名匹配记录

转载 作者:行者123 更新时间:2023-12-03 14:51:48 30 4
gpt4 key购买 nike

是否有任何工具或方法可用于在两个不同数据源之间通过人名进行匹配?

这些系统没有其他共同信息,而且在许多情况下输入的名称不同。

非精确匹配示例:

King Jr., Martin Luther = King, Martin (不包括后缀)
Erving, J. 博士 = Erving, J.(不包括前缀)
奥巴马,巴拉克侯赛因 = 奥巴马,巴拉克(不包括中间名)
Pufnstuf, H.R. = Pufnstuf, Haibane Renmei(匹配缩写)
Tankengine, Thomas = Tankengine, Tom(匹配常用昵称)
Flair, Rick “the Natureboy” = Flair, Natureboy(昵称匹配)

最佳答案

我不得不使用建议的各种技术。感谢为我指明正确的方向。希望以下内容可以帮助其他人解决此类问题。

删除多余字符

CREATE FUNCTION [dbo].[fn_StripCharacters]
(
@String NVARCHAR(MAX),
@MatchExpression VARCHAR(255)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
SET @MatchExpression = '%['+@MatchExpression+']%'

WHILE PatIndex(@MatchExpression, @String) > 0
SET @String = Stuff(@String, PatIndex(@MatchExpression, @String), 1, '')

RETURN @String

END

用法:
--remove all non-alphanumeric and non-white space  
dbo.fn_StripCharacters(@Value, , '^a-z^0-9 ')

将名称拆分为多个部分
CREATE FUNCTION [dbo].[SplitTable] (@sep char(1), @sList StringList READONLY)
RETURNS @ResultList TABLE
(
[ID] VARCHAR(MAX),
[Val] VARCHAR(MAX)
)
AS
BEGIN

declare @OuterCursor cursor
declare @ID varchar(max)
declare @Val varchar(max)

set @OuterCursor = cursor fast_forward for (SELECT * FROM @sList) FOR READ ONLY

open @OuterCursor

fetch next from @OuterCursor into @ID, @Val

while (@@FETCH_STATUS=0)
begin

INSERT INTO @ResultList (ID, Val)
select @ID, split.s from dbo.Split(@sep, @Val) as split
where len(split.s) > 0

fetch next from @OuterCursor into @ID, @Val
end

close @OuterCursor
deallocate @OuterCursor

CREATE FUNCTION [dbo].[Split] (@sep char(1), @s varchar(8000))
RETURNS table
AS
RETURN (
WITH Pieces(pn, start, stop) AS (
SELECT 1, 1, CHARINDEX(@sep, @s)
UNION ALL
SELECT pn + 1, stop + 1, CHARINDEX(@sep, @s, stop + 1)
FROM Pieces
WHERE stop > 0
)
SELECT pn,
LTRIM(RTRIM(SUBSTRING(@s, start,
CASE WHEN stop > 0
THEN stop-start
ELSE 8000
END))) AS s
FROM Pieces
)

RETURN

用法:
--create split name list
DECLARE @NameList StringList

INSERT INTO @NameList (ID, Val)
SELECT id, firstname FROM dbo.[User] u
WHERE PATINDEX('%[^a-z]%', u.FirstName) > 0

----remove split dups
select u.ID, COUNT(*)
from dbo.import_SplitTable(' ', @NameList) splitList
INNER JOIN dbo.[User] u
ON splitList.id = u.id

常用昵称:

我根据 this list 创建了一个表并用它来加入通用名称等价物。

用法:
SELECT u.id
, u.FirstName
, u_nickname_maybe.Name AS MaybeNickname
, u.LastName
, c.ID AS ContactID from
FROM dbo.[User] u
INNER JOIN nickname u_nickname_match
ON u.FirstName = u_nickname_match.Name
INNER JOIN nickname u_nickname_maybe
ON u_nickname_match.relatedid = u_nickname_maybe.id
LEFT OUTER JOIN
(
SELECT c.id, c.LastName, c.FirstName,
c_nickname_maybe.Name AS MaybeFirstName
FROM dbo.Contact c
INNER JOIN nickname c_nickname_match
ON c.FirstName = c_nickname_match.Name
INNER JOIN nickname c_nickname_maybe
ON c_nickname_match.relatedid = c_nickname_maybe.id
WHERE c_nickname_match.Name <> c_nickname_maybe.Name
) as c
ON c.AccountHolderID = ah.ID
AND u_nickname_maybe.Name = c.MaybeFirstName AND u.LastName = c.LastName
WHERE u_nickname_match.Name <> u_nickname_maybe.Name

语音算法 (Jaro Winkler):

惊人的文章, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server ,显示如何安装和使用 SimMetrics库到 SQL Server。该库可让您找到字符串之间的相对相似性,并包含许多算法。我最终主要使用 Jaro Winkler以匹配名称。

用法:
SELECT
u.id AS UserID
,c.id AS ContactID
,u.FirstName
,c.FirstName
,u.LastName
,c.LastName
,maxResult.CombinedScores
from
(
SELECT
u.ID
,
max(
dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))
* dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))
) AS CombinedScores
FROM dbo.[User] u, dbo.[Contact] c
WHERE u.ContactID IS NULL
GROUP BY u.id
) AS maxResult
INNER JOIN dbo.[User] u
ON maxResult.id = u.id
INNER JOIN dbo.[Contact] c
ON maxResult.CombinedScores =
dbo.JaroWinkler(lower(u.FirstName), lower(c.FirstName))
* dbo.JaroWinkler(LOWER(u.LastName), LOWER(c.LastName))

关于pattern-matching - 根据人名匹配记录,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/988050/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com