gpt4 book ai didi

r - 如何在R中执行近似(模糊)名称匹配

转载 作者:行者123 更新时间:2023-12-04 12:04:39 24 4
gpt4 key购买 nike

我有一个很大的数据集,专门用于生物期刊,由不同的人长期编写。因此,数据不是单一格式。例如,在“作者”列中,我可以找到同一个人的约翰·史密斯,史密斯·约翰,史密斯·J等。我什至无法执行最简单的 Action 。例如,我不知道哪些作者写的文章最多。

R中是否有任何方法可以确定不同名称中的大多数符号是否相同,将它们视为相同的元素?

最佳答案

有一些可以帮助您解决此问题的软件包,其中一些已在注释中列出。但是,如果您不想使用它们,尽管我会尝试用R编写一些可能对您有所帮助的东西。该代码将使“John Smith”与“J Smith”,“John Smith”,“Smith John”,“John S”匹配。同时,它不会匹配“John Sally”之类的东西。

# generate some random names
names = c(
"John Smith",
"Wigberht Ernust",
"Samir Henning",
"Everette Arron",
"Erik Conor",
"Smith J",
"Smith John",
"John S",
"John Sally"
);

# split those names and get all ways to write that name
split_names = lapply(
X = names,
FUN = function(x){
print(x);
# split by a space
c_split = unlist(x = strsplit(x = x, split = " "));
# get both combinations of c_split to compensate for order
c_splits = list(c_split, rev(x = c_split));
# return c_splits
c_splits;
}
)

# suppose we're looking for John Smith
search_for = "John Smith";

# split it by " " and then find all ways to write that name
search_for_split = unlist(x = strsplit(x = x, split = " "));
search_for_split = list(search_for_split, rev(x = search_for_split));

# initialise a vector containing if search_for was matched in names
match_statuses = c();

# for each name that's been split
for(i in 1:length(x = names)){

# the match status for the current name
match_status = FALSE;

# the current split name
c_split_name = split_names[[i]];

# for each element in search_for_split
for(j in 1:length(x = search_for_split)){

# the current combination of name
c_search_for_split_names = search_for_split[[j]];

# for each element in c_split_name
for(k in 1:length(x = c_split_name)){

# the current combination of current split name
c_c_split_name = c_split_name[[k]];

# if there's a match, or the length of grep (a pattern finding function is
# greater than zero)
if(
# is c_search_for_split_names first element in c_c_split_name first
# element
length(
x = grep(
pattern = c_search_for_split_names[1],
x = c_c_split_name[1]
)
) > 0 &&
# is c_search_for_split_names second element in c_c_split_name second
# element
length(
x = grep(
pattern = c_search_for_split_names[2],
x = c_c_split_name[2]
)
) > 0 ||
# or, is c_c_split_name first element in c_search_for_split_names first
# element
length(
x = grep(
pattern = c_c_split_name[1],
x = c_search_for_split_names[1]
)
) > 0 &&
# is c_c_split_name second element in c_search_for_split_names second
# element
length(
x = grep(
pattern = c_c_split_name[2],
x = c_search_for_split_names[2]
)
) > 0
){
# if this is the case, update match status to TRUE
match_status = TRUE;
} else {
# otherwise, don't update match status
}
}
}

# append match_status to the match_statuses list
match_statuses = c(match_statuses, match_status);
}

search_for;

[1] "John Smith"

cbind(names, match_statuses);

names match_statuses
[1,] "John Smith" "TRUE"
[2,] "Wigberht Ernust" "FALSE"
[3,] "Samir Henning" "FALSE"
[4,] "Everette Arron" "FALSE"
[5,] "Erik Conor" "FALSE"
[6,] "Smith J" "TRUE"
[7,] "Smith John" "TRUE"
[8,] "John S" "TRUE"
[9,] "John Sally" "FALSE"

希望此代码可以作为起点,并且您可能希望对其进行调整以使用任意长度的名称。

一些注意事项:

R中的
  • for循环可能很慢。如果您要处理大量名称,请查看Rcpp
  • 您可能希望将其包装在一个函数中。然后,您可以通过调整search_for将其应用于其他名称。
  • 此示例存在时间复杂性问题,根据您的数据大小,您可能希望/需要对其进行重新处理。
  • 关于r - 如何在R中执行近似(模糊)名称匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22894265/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com