gpt4 book ai didi

python - 从字符串中提取完整的国家/地区名称并将其作为数据框列

转载 作者:行者123 更新时间:2023-11-30 22:42:20 25 4
gpt4 key购买 nike

我有如下数据。如何将以下数据转换为数据框。我需要国家/地区名称(某些国家/地区名称之间有逗号)作为第一列,其他值作为每列。

输入是一个包含多个国家/地区的txt文件

捷克共和国,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10, 9 刚果民主共和国,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666刚果共和国,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485

输出应该是一个数据框,其中国家/地区名称作为第一列

Czech Republic  22  22  22  21  21  21  21  21  19  18  16  14  13  12  11  11  10  9

Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354 402 509 477 482 511 485

Congo, Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643 697 708 710 702 692 666

最佳答案

您可以先使用read_csv (如果是 .txt 文件,则没有问题),并且带有一些分隔符,该分隔符不在 | 之类的值中,例如 Series,然后 extractstrip国家/地区名称到一列和另一列值 split通过:

import pandas as pd
from pandas.compat import StringIO

temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
s = pd.read_csv(StringIO(temp), sep="|", header=None, squeeze=True)
print (s)
0 Czech Republic,22,22,22,21,21,21,21,21,19,18,1...
1 Congo,Dem.Rep.,275,306,327,352,376,411,420,466...
2 Congo,Rep.,209,222,231,243,255,269,424,457,367...
Name: 0, dtype: object

df = s.str.extract('([A-Za-z ,.]+)([0-9,]+)', expand=True)
df[0] = df[0].str.strip(',')
df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None).reset_index()
#reset column names by 0,1,2...
df.columns = np.arange(len(df.columns))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354

13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485
<小时/>

如果需要国家/地区索引:

df = df.set_index(0)[1].str.split(',', expand=True).rename_axis(None)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 \
Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354

12 13 14 15 16 17
Czech Republic 13 12 11 11 10 9
Congo,Dem.Rep. 697 708 710 702 692 666
Congo,Rep. 402 509 477 482 511 485

解决方案是来自另一个的正则表达式 answer - 可以将其用作 sep 参数,只需 engine='python' ,因为警告:

import pandas as pd
from pandas.compat import StringIO


temp=u"""Czech Republic,22,22,22,21,21,21,21,21,19,18,16,14,13,12,11,11,10,9
Congo,Dem.Rep.,275,306,327,352,376,411,420,466,472,528,592,643,697,708,710,702,692,666
Congo,Rep.,209,222,231,243,255,269,424,457,367,545,313,354,402,509,477,482,511,485"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep=",(?=\d)", header=None, engine='python')

print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 Czech Republic 22 22 22 21 21 21 21 21 19 18 16 14
1 Congo,Dem.Rep. 275 306 327 352 376 411 420 466 472 528 592 643
2 Congo,Rep. 209 222 231 243 255 269 424 457 367 545 313 354

13 14 15 16 17 18
0 13 12 11 11 10 9
1 697 708 710 702 692 666
2 402 509 477 482 511 485

关于python - 从字符串中提取完整的国家/地区名称并将其作为数据框列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42174698/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com