gpt4 book ai didi

regex - 在R中正确使用gsub/正则表达式?

转载 作者:行者123 更新时间:2023-12-04 13:52:36 25 4
gpt4 key购买 nike

我有很长的字符串列表,例如此机器可读的示例:

A <- list(c("Biology","Cell Biology","Art","Humanities, Multidisciplinary; Psychology, Experimental","Astronomy & Astrophysics; Physics, Particles & Fields","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods","Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science","Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic","Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"))  

所以看起来像这样:
> A  
[[1]]
[1] "Biology"
[2] "Cell Biology"
[3] "Art"
[4] "Humanities, Multidisciplinary; Psychology, Experimental"
[5] "Astronomy & Astrophysics; Physics, Particles & Fields"
[6] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods"
[7] "Geriatrics & Gerontology"
[8] "Gerontology"
[9] "Management"
[10] "Operations Research & Management Science"
[11] "Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic"
[12] "Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability"

我想编辑这些术语并消除重复项,以获得以下结果:
 [1] "Science"  
[2] "Science"
[3] "Arts & Humanities"
[4] "Arts & Humanities; Social Sciences"
[5] "Science"
[6] "Social Sciences; Science"
[7] "Science"
[8] "Social Sciences"
[9] "Social Sciences"
[10] "Science"
[11] "Science"
[12] "Social Sciences; Science"

到目前为止,我只有这样:
stringedit <- function(A)  
{
A <-gsub("Biology", "Science", A)
A <-gsub("Cell Biology", "Science", A)
A <-gsub("Art", "Arts & Humanities", A)
A <-gsub("Humanities, Multidisciplinary", "Arts & Humanities", A)
A <-gsub("Psychology, Experimental", "Social Sciences", A)
A <-gsub("Astronomy & Astrophysics", "Science", A)
A <-gsub("Physics, Particles & Fields", "Science", A)
A <-gsub("Economics", "Social Sciences", A)
A <-gsub("Mathematics", "Science", A)
A <-gsub("Mathematics, Applied", "Science", A)
A <-gsub("Mathematics, Interdisciplinary Applications", "Science", A)
A <-gsub("Social Sciences, Mathematical Methods", "Social Sciences", A)
A <-gsub("Geriatrics & Gerontology", "Science", A)
A <-gsub("Gerontology", "Social Sciences", A)
A <-gsub("Management", "Social Sciences", A)
A <-gsub("Operations Research & Management Science", "Science", A)
A <-gsub("Computer Science, Artificial Intelligence", "Science", A)
A <-gsub("Computer Science, Information Systems", "Science", A)
A <-gsub("Engineering, Electrical & Electronic", "Science", A)
A <-gsub("Statistics & Probability", "Science", A)
}
B <- lapply(A, stringedit)

但是它不能正常工作:
> B  
[[1]]
[1] "Science"
[2] "Cell Science"
[3] "Arts & Humanities"
[4] "Arts & Humanities; Social Sciences"
[5] "Science; Science"
[6] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences"
[7] "Science"
[8] "Social Sciences"
[9] "Social Sciences"
[10] "Operations Research & Social Sciences Science"
[11] "Computer Science, Arts & Humanitiesificial Intelligence; Science; Science"
[12] "Social Sciences; Science, Interdisciplinary Applications; Social Sciences; Science"

如何获得上述正确的输出?
非常感谢您的考虑!

最佳答案

让我从一个例子开始。您有一个字符串“Cell Biology”。第一个替换词A <-gsub("Biology", "Science", A)将其转换为“细胞科学”。然后,该值将不被替换。

由于您不使用正则表达式,因此我宁愿使用某种哈希来进行替换:

myhash <- c( "Science", "Science", "Arts & Humanities", "Arts & Humanities", "Social Sciences", 
"Science", "Science", "Social Sciences", "Science", "Science", "Science", "Social Sciences",
"Science", "Social Sciences", "Social Sciences", "Science", "Science", "Science", "Science",
"Science" )

names( myhash ) <- c( "Biology", "Cell Biology", "Art", "Humanities, Multidisciplinary",
"Psychology, Experimental", "Astronomy & Astrophysics", "Physics, Particles & Fields", "Economics",
"Mathematics", "Mathematics, Applied", "Mathematics, Interdisciplinary Applications",
"Social Sciences, Mathematical Methods", "Geriatrics & Gerontology", "Gerontology", "Management",
"Operations Research & Management Science", "Computer Science, Artificial Intelligence",
"Computer Science, Information Systems", "Engineering, Electrical & Electronic",
"Statistics & Probability" )

现在,给定诸如“Biology”之类的字符串,您可以快速查找类别:
myhash[ "Biology" ]

我不确定为什么要使用列表而不是字符串向量,因此我将简化您的情况:
A <- c("Biology","Cell Biology","Art",
"Humanities, Multidisciplinary; Psychology, Experimental",
"Astronomy & Astrophysics; Physics, Particles & Fields",
"Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods",
"Geriatrics & Gerontology","Gerontology","Management","Operations Research & Management Science",
"Computer Science, Artificial Intelligence; Computer Science, Information Systems; Engineering, Electrical & Electronic",
"Economics; Mathematics, Interdisciplinary Applications; Social Sciences, Mathematical Methods; Statistics & Probability")

has查找不适用于复合字符串(包含“;”)。您可以拆分它们,但是可以使用 strsplit。然后,您可以使用 unique避免术语重复,并使用 paste函数将其放回原处。
stringedit <- function( x ) { 
# first, split into subterms
a.all <- unlist( strsplit( x, "; *" ) ) ;
paste( unique( myhash[ a.all ] ), collapse= "; " )
}

unlist( lapply( A, stringedit ) )

根据需要,结果如下:
[1] "Science"                            "Science"                            "Arts & Humanities"                  "Arts & Humanities; Social Sciences"
[5] "Science" "Social Sciences; Science" "Science" "Social Sciences"
[9] "Social Sciences" "Science" "Science" "Social Sciences; Science"

当然,您可以像这样多次调用 *apply:
a.spl <- sapply( A, strsplit, "; *" )
a.spl <- sapply( a.spl, function( x ) myhash[ x ] )
unlist( sapply( a.spl, collapse, "; " )

这并不比以前的代码有效率。

是的,您可以使用正则表达式实现相同的功能,但是首先,它将涉及拆分字符串,然后使用regex之类的 ^Biology$来确保它们将匹配“Biology”而不是“Cell Biology”等。除非您要用于“。* Biology”之类的结构。最后,无论如何,我都必须摆脱重复,而我认为(i)不太冗长(=容易出错)和(ii)不值得付出任何努力。

关于regex - 在R中正确使用gsub/正则表达式?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13009761/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com