gpt4 book ai didi

r - 使用 R 中的 stringi 提取字符串中某些字符之后的多个子字符串

转载 作者:行者123 更新时间:2023-12-04 10:36:46 28 4
gpt4 key购买 nike

我在 R 中有一个大型数据框,其中有一列看起来像这样,其中每个句子都是一行

data <- data.frame(
datalist = c("anarchism is a wiki/political_philosophy that advocates wiki/self-governance societies based on voluntary institutions",
"these are often described as wiki/stateless_society although several authors have defined them more specifically as institutions based on non- wiki/hierarchy or wiki/free_association_(communism_and_anarchism)",
"anarchism holds the wiki/state_(polity) to be undesirable unnecessary and harmful",
"while wiki/anti-statism is central anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations"),
stringsAsFactors=FALSE)

我想把“wiki/”后面的所有词都提取出来放到另外一列

所以第一行应该出现“political_philosophy self-governance”第二行应该看起来像“hierarchy free_association_(communism_and_anarchism)”第三行应该是“state_(polity)”而第四行应该是“反国家主义”

我绝对想使用 stringi,因为它是一个巨大的数据框。在此先感谢您的帮助。

我试过了

stri_extract_all_fixed(data$datalist, "wiki")[[1]]

但这只是提取单词 wiki

最佳答案

您可以使用正则表达式来做到这一点。通过使用 stri_match_ 而不是 stri_extract_ 我们可以使用括号来制作匹配组,让我们只提取正则表达式匹配的一部分。在下面的结果中,您可以看到 df 的每一行都给出了一个列表项,其中包含一个数据框,第一列是整个匹配项,接下来的列是每个匹配组:

match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
match

[[1]]
[,1] [,2]
[1,] "wiki/political_philosophy" "political_philosophy"
[2,] "wiki/self-governance" "self-governance"

[[2]]
[,1] [,2]
[1,] "wiki/stateless_society" "stateless_society"
[2,] "wiki/hierarchy" "hierarchy"
[3,] "wiki/free_association_(communism_and_anarchism)" "free_association_(communism_and_anarchism)"

[[3]]
[,1] [,2]
[1,] "wiki/state_(polity)" "state_(polity)"

[[4]]
[,1] [,2]
[1,] "wiki/anti-statism" "anti-statism"

然后您可以使用应用函数将数据变成您想要的任何形式:

match <- stri_match_all_regex(df$datalist, "wiki/([\\w-()]*)")
sapply(match, function(x) paste(x[,2], collapse = " "))

[1] "political_philosophy self-governance"
[2] "stateless_society hierarchy free_association_(communism_and_anarchism)"
[3] "state_(polity)"
[4] "anti-statism"

关于r - 使用 R 中的 stringi 提取字符串中某些字符之后的多个子字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50241313/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com