gpt4 book ai didi

r - 我正在尝试使用 stringr,特别是正则表达式,来分割 "MA: Bristol County (25005)"

转载 作者:行者123 更新时间:2023-12-02 19:06:31 24 4
gpt4 key购买 nike

我正在尝试获取一个变量列并将其分成几列。这些值遵循基本模式,县名称具有多种长度和格式。

State-county :
[1] "MA: Bristol County (25005)"
[2] "LA: St. Tammany Parish (22103)"
[3] "CA: Ventura County (06111)"
[4] "CA: San Mateo County (06081)"

我需要一个州、县名称和县代码列,我可以将其添加回 data.frame 中。一直试图弄清楚如何使用 str_extract 来完成任务。理想情况下,这就是我最终的结局,但我会寻求任何可以获得的帮助。

  state:    county:            county code: 
[1] "MA" Bristol County 25005
[2] "LA" St. Tammany Parish 22103
[3] "CA" Ventura County 06111
[4] "CA: San Mateo County 06081

我能够使用我找到的代码 str_extract_all( "(?<=\\().+?(?=\\))")对于县代码(感谢 Nettle ),我能到达的最接近的州 abrev 是 'str_extract_all( h,"..:")它很接近,但包含“:”还尝试过:str_extract_all( "(?<=\\:")

抱歉,如果这不是最好的格式,我试图以我所见过的风格表达得非常清晰。

最佳答案

使用str_match_all:

str_match_all(df$State_county, "([A-Z]+): ([^()]+) \\((\\d+)\\)")

as_tibble(df) %>%
mutate(matches=str_match_all(State_county, "([A-Z]+): ([^()]+) \\((\\d+)\\)")) %>%
unnest_wider(matches) %>%
select(-2) %>%
set_names("State_county", "State", "County", "ZIP")
# A tibble: 4 x 4
State_county State County ZIP
<fct> <chr> <chr> <chr>
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081

### OR with str_match as we're only using a single pattern
## this saves us from the warning caused by unnest_wider
as_tibble(df) %>%
mutate(matches=str_match(State_county, "([A-Z]+): ([^()]+) \\((\\d+)\\)"), State=matches[,2], County=matches[,3], ZIP=matches[,4], matches=NULL)
# A tibble: 4 x 4
State_county State County ZIP
<fct> <chr> <chr> <chr>
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081
### Another way
str_match(df$State_county, "([A-Z]+): ([^()]+) \\((\\d+)\\)") %>%
as.data.frame %>% set_names("State_county", "State", "County", "County_code")
State_county State County County_code
1 MA: Bristol County (25005) MA Bristol County 25005
2 LA: St. Tammany Parish (22103) LA St. Tammany Parish 22103
3 CA: Ventura County (06111) CA Ventura County 06111
4 CA: San Mateo County (06081) CA San Mateo County 06081

说明:

str_match 基本上将返回捕获的组(用非转义括号编写的子模式([A-Z]+))以及与完整模式匹配的完整字符串

  • [A-Z]+:匹配状态缩写。
  • [^()]+ :匹配非左括号的任何内容。县。
  • \\((\\d+)\\) :匹配左括号 \\( 并在使用分组提取数字时关闭一个括号。县代码.
str_match(df$State_county, "([A-Z]+): ([^()]+) \\((\\d+)\\)")
[,1] [,2] [,3] [,4]
[1,] "MA: Bristol County (25005)" "MA" "Bristol County" "25005"
[2,] "LA: St. Tammany Parish (22103)" "LA" "St. Tammany Parish" "22103"
[3,] "CA: Ventura County (06111)" "CA" "Ventura County" "06111"
[4,] "CA: San Mateo County (06081)" "CA" "San Mateo County" "06081"

关于r - 我正在尝试使用 stringr,特别是正则表达式,来分割 "MA: Bristol County (25005)",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64979357/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com