gpt4 book ai didi

python - Pandas :数据框不会合并

转载 作者:行者123 更新时间:2023-11-28 16:25:48 24 4
gpt4 key购买 nike

我在下面有两个数据框(可以在 herehere 中找到):

df= pd.read_csv('Thesis/ExternalData/naics_conversion_data/SIC2CRPCats.csv', \
engine='python', sep=r'\s{2,}', encoding='utf-8_sig')

我只提供了在 df 中读取的代码,因为它有一些独特的格式问题。

df.dtypes

SICcode object
Catcode object
Category object
SICname object
MultSIC object
dtype: object

merged.dtypes

2012 NAICS Code float64
2002to2007 NAICS float64
SICcode object
dtype: object

df.columns.tolist()
['SICcode', 'Catcode', 'Category', 'SICname', 'MultSIC']

merged.columns.tolist()
['2012 NAICS Code', '2002to2007 NAICS', 'SICcode']

df.head(3)

SICcode Catcode Category SICname MultSIC
0 111 A1500 Wheat, corn, soybeans and cash grain Wheat X
1 112 A1600 Other commodities (incl rice, peanuts) Rice X
2 115 A1500 Wheat, corn, soybeans and cash grain Corn X

merged.sort_values('SICcode')

2012 NAICS Code 2002to2007 NAICS SICcode
89 212210 212210 1011
93 212234 212234 1021
92 212231 212231 1031
90 212221 212221 1041
91 212222 212222 1044
96 212299 212299 1061
94 212234 212234 1061
119 213114 213114 1081
1770 541360 541360 1081
233 238910 238910 1081
95 212291 212291 1094
97 212299 212299 1099
3 111140 111140 111
6 111160 111160 112
4 111150 111150 115
0 111110 111110 116

我正在尝试使用以下代码将它们合并在一起:merged=pd.merge(merged,df, how='right', on='SICcode')

结果是:

2012 NAICS Code        0
2002to2007 NAICS 0
SICcode 1007
Catcode 991
Category 1007
SICname 1007
MultSIC 906
dtype: int64

我怀疑问题出在 df 的格式上,但我不知道如何描述(我听说过术语 white space,也许这与这种情况)或解决问题。有人对此有想法吗?

最佳答案

我相信这是你的问题的原因:

In [47]: merged[merged.SICcode == 'Aux']
Out[47]:
2012 NAICS Code 2002to2007 NAICS SICcode
1828 551114.0 551114.0 Aux

导致不同的数据类型:

In [61]: df.dtypes
Out[61]:
SICcode int64
Catcode object
Category object
SICname object
MultSIC object
dtype: object

In [62]: merged.dtypes
Out[62]:
2012 NAICS Code float64
2002to2007 NAICS float64
SICcode object
dtype: object

In [63]: df.SICcode.unique()
Out[63]: array([ 111, 112, 115, ..., 9711, 9721, 9999], dtype=int64)

In [64]: merged.SICcode.head(10).unique()
Out[64]: array(['116', '119', '111', '115', '112', '139'], dtype=object)

所以你可以这样做:

url = 'https://raw.githubusercontent.com/108michael/ms_thesis/master/SIC2CRPCats.csv'
df = pd.read_csv(url, engine='python', sep=r'\s{2,}', encoding='utf-8_sig')

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/test.merge'
merged = pd.read_csv(url, index_col=0)

# clearing data
merged.SICcode = pd.to_numeric(merged.SICcode, errors='coerce')

mrg = df.merge(merged, on='SICcode', how='left')

mrg.head()

输出:

In [51]: mrg.head()
Out[51]:
SICcode Catcode Category \
0 111 A1500 Wheat, corn, soybeans and cash grain
1 112 A1600 Other commodities (incl rice, peanuts, honey)
2 115 A1500 Wheat, corn, soybeans and cash grain
3 116 A1500 Wheat, corn, soybeans and cash grain
4 119 A1500 Wheat, corn, soybeans and cash grain

SICname MultSIC 2012 NAICS Code 2002to2007 NAICS
0 Wheat X 111140.0 111140.0
1 Rice X 111160.0 111160.0
2 Corn X 111150.0 111150.0
3 Soybeans X 111110.0 111110.0
4 Cash grains, NEC X 111120.0 111120.0

关于python - Pandas :数据框不会合并,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36808357/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com