gpt4 book ai didi

python - 基于现有列值和条件列表的 Pandas 数据框中的新列

转载 作者:太空宇宙 更新时间:2023-11-04 11:10:13 25 4
gpt4 key购买 nike

点击此链接: New column in pandas dataframe based on existing column values

我有一个数据框,其中包含一个名为“国家/地区”的列,其中列出了世界上的多个国家/地区。我需要创建另一个带有区域说明符(如“欧洲”)的列。我有属于几个地区的三个国家列表,因此如果 df ['Country'] 中的州与 'Europe' 列表中的州匹配,则 'Europe' 说明符将插入新列 df['Region'] .

我的数据是: https://sendeyo.com/up/d/2acd2eb849

问题是,当我使用上一个链接中表达的解决方案时,它们适用于示例数据框,但不适用于我的数据库。我的数据框是这样的:

Year    Country Population  GDP 
1870 Austria 4,520 8,419
1870 Belgium 5,096 13,716
1870 Denmark 1,888 3,782
1870 Finland 1,754 1,999
1870 France 38,440 72,100

我的 list :

Europa = ["Austria", "Belgium", "Denmark"]

RamasOccidentales = ["Australia","New Zealand","Canada","United States"]

Latinoamerica = ["Brazil","Chile","Uruguay"]

Asia = ["Indonesia","Japan","Sri Lanka"]

预期结果

Year    Country Population  GDP Region
1870 Austria 4,520 8,419 Europa
1870 Belgium 5,096 13,716 Europa
1870 Denmark 1,888 3,782 Europa
1870 Finland 1,754 1,999 Europa
1870 France 38,440 72,100 Europa

这是我试过的代码:

def Continent(country):
return "Europa" if country in Europa else "Unknown"

df['Region'] = df['Country'].apply(Continent)

谢谢。

最佳答案

使用Series.map使用从列表创建的字典:

Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]

d = {'Europa':Europa,'RamasOccidentales':RamasOccidentales,
'Latinoamerica':Latinoamerica,'Asia':Asia}

#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}

df['Region'] = df['Country'].map(d1)

print (df)
Year Country Population GDP Region
0 1870 Austria 4,520 8,419 Europa
1 1870 Belgium 5,096 13,716 Europa
2 1870 Denmark 1,888 3,782 Europa
3 1870 Finland 1,754 1,999 Europa
4 1870 France 38,440 72,100 Europa

print (d1)

{'Austria': 'Europa', 'Belgium': 'Europa', 'Denmark': 'Europa',
'France': 'Europa', 'Finland': 'Europa',
'Australia': 'RamasOccidentales',
'New Zealand': 'RamasOccidentales',
'Canada': 'RamasOccidentales',
'United States': 'RamasOccidentales',
'Brazil': 'Latinoamerica', 'Chile': 'Latinoamerica',
'Uruguay': 'Latinoamerica', 'Indonesia': 'Asia',
'Japan': 'Asia', 'Sri Lanka': 'Asia'}

性能是 10k 行的 2.58 倍:

np.random.seed(2019)

Europa = ["Austria", "Belgium", "Denmark",'France','Finland']
RamasOccidentales = ["Australia","New Zealand","Canada","United States"]
Latinoamerica = ["Brazil","Chile","Uruguay"]
Asia = ["Indonesia","Japan","Sri Lanka"]

d = {'Europa':Europa,'RamasOccidentales':RamasOccidentales,
'Latinoamerica':Latinoamerica,'Asia':Asia}

d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
df = pd.DataFrame({'Country': np.random.choice(list(d1.keys()), size=10000)})

In [280]: %%timeit
...: d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
...:
...: df['Region'] = df['Country'].map(d1)
...:
3.04 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [281]: %%timeit
...: classification_countries={'Europa':Europa,
...: 'RamasOccidentales':RamasOccidentales,
...: 'Latinoamerica':Latinoamerica ,
...: 'Asia':Asia}
...:
...: cond=[df['Country'].isin(classification_countries[key]) for key in classification_countries]
...: values=[ key for key in classification_countries]
...:
...: df['Region']=np.select(cond,values)
...:
7.86 ms ± 56.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [282]: %%timeit
...: cond=[df['Country'].isin(Europa),df['Country'].isin(RamasOccidentales),df['Country'].isin(Latinoamerica),df['Country'].isin(Asia)]
...: values=['Europa','RamasOccidentales','Latinoamerica','Asia']
...: df['Region']=np.select(cond,values)
...:
7.96 ms ± 281 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [293]: %%timeit
...: classification_countries={'Europa':Europa,
...: 'RamasOccidentales':RamasOccidentales,
...: 'Latinoamerica':Latinoamerica ,
...: 'Asia':Asia}
...:
...: dict_cond_values= {key:df['Country'].isin(classification_countries[key]) for key in classification_countries}
...:
...:
...: df['Region']=np.select(dict_cond_values.values(),dict_cond_values.keys())
...:
8.54 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

关于python - 基于现有列值和条件列表的 Pandas 数据框中的新列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58451464/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com