gpt4 book ai didi

r - 从R中的字符串向量中提取城市

转载 作者:行者123 更新时间:2023-12-01 12:15:50 26 4
gpt4 key购买 nike

我的数据集数据库中有一列,比如 db$affiliation,它看起来像:

**db$affiliation**
[1] "[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA"
[2] "[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS."
[3] "[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND."
[4] ...

我想在同一数据集中创建一个仅包含 db$affiliation 中城市名称的列,例如

 **db$cities**
[1] LOS ANGELES
[2] TWENTE
[3] BANGKOK
[4] ...

如果有多个城市名称可用,我希望命令只返回最后一个,如果没有可用的城市名称,我希望有 NA。我该怎么做?

我以为我可以在 maps 包的 data(world.cities) 中使用 world.cities$name 但我不能弄清楚如何。

我什至尝试拆分 db$affiliation 列,例如:

db$affiliation <- gsub("\\[[^\\]]*\\]", "", db$affiliation, perl=TRUE) # remove content within brackets 
db$affiliation[2] # check the separator
db <- cSplit(db, 'affiliation', sep=c(", "), type.convert=FALSE) # split after comma

结果(我在 affiliation_3 之后截断了它):

    affiliation_1            affiliation_2                  affiliation_3 
[1] UNIV CALIF LOS ANGELES DEPT GEOG LOS ANGELES
[2] UNIV TWENTE DEPT WATER ENGN & MANAGEMENT DRIENERLOLAAN
[3] CHULALONGKORN UNIV FAC ARCHITECTURE BANGKOK

然后通过:

db$cities <- lapply(db$affiliation_1, function(x)x[which(x %in% world.cities$name)])

但是我得到一个空列。

感谢您的帮助!

最佳答案

示例字符串中有很多城市,如果在 affiliation 列中找到多个城市,如果您仍想获取“最后一个城市”,则可能需要重新考虑。

library(maps)
data(world.cities)

#sample data
df <- data.frame(affiliation = c("[SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA",
"[VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.",
"[ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.",
"Prem"), stringsAsFactors = F)

#fetch city and it's respective country from 'affiliation' column
cities_country <- lapply(gsub("\\[|\\]|[,;]|\\.","",df$affiliation), function(x)
paste(as.character(world.cities$name[sapply(world.cities$name, grepl, x, ignore.case=T)]),
as.character(world.cities$country.etc[sapply(world.cities$name, grepl, x, ignore.case=T)]),
sep="_"))
df$cities_country <- lapply(cities_country, function(x) if(identical(x, character(0))) NA_character_ else x)
df

输出是:

affiliation
1 [SCOTT, ALLEN J.] UNIV CALIF LOS ANGELES, DEPT GEOG, LOS ANGELES, CA 90095 USA
2 [VAN DUINEN, RIANNE; VAN DER VEEN, ANNE] UNIV TWENTE, DEPT WATER ENGN & MANAGEMENT, DRIENERLOLAAN 5,POB 217, NL-7500 AE ENSCHEDE, NETHERLANDS.
3 [ANANTSUKSOMSRI, SUTEE] CHULALONGKORN UNIV, FAC ARCHITECTURE, BANGKOK, THAILAND.
4 Prem
cities_country
1 Al_Norway, Alle_Switzerland, Allen_Philippines, Allen_USA, Angeles_Costa Rica, Angeles_Philippines, Cali_Colombia, Cot_Costa Rica, Li_Norway, Los Angeles_Chile, Los Angeles_USA, Os_Kyrgyzstan, Os_Norway, U_Micronesia, Usa_Japan
2 Ae_Marshall Islands, Ede_Netherlands, Ede_Nigeria, Enschede_Netherlands, Hede_China, Ine_Marshall Islands, Laa_Austria, Lola_Guinea, Man_Ivory Coast, Mana_French Guiana, Manage_Belgium, Nagem_Luxembourg, Ob_Russia, Ola_Panama, Po_Burkina Faso, U_Micronesia, Van_Turkey, Wa_Ghana, We_New Caledonia
3 Aila_Estonia, Al_Norway, Anan_Japan, Ba_Fiji, Bangkok_Thailand, Hit_Iraq, Ila_Nigeria, Ilan_Taiwan, Long_Thailand, Nan_Thailand, Tsu_Japan, U_Micronesia, Ula_Turkey
4 NA

(请注意,在上面的输出中,我保留了所有出现的城市,为方便起见,还在其后缀加上各自的国家/地区)

关于r - 从R中的字符串向量中提取城市,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48224385/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com