gpt4 book ai didi

python - 使用 pd.Series() 从现有列添加新列会创建 NaN 值

转载 作者:太空宇宙 更新时间:2023-11-03 14:33:15 26 4
gpt4 key购买 nike

我想根据现有列向 DataFrame 添加新列。新列只是三列的三个值的元组:

df0.shape
# (5410185, 17)
new_col = pd.Series(list(zip(df0['a'], df0['b'], df0['c'])))
new_col.shape
# (5410185,)
new_col.isnull().sum()
# 0
df0['abc'] = new_col
df0['abc'].isnull().sum()
# 14334

我在示例 df 上尝试了相同的方法,它按预期工作:

test = pd.DataFrame(np.random.randint(0,1000,100000000).reshape(1000000,100))
test['new'] = pd.Series(list(zip(test[1], test[2], test[3])))
test['new'].isnull().sum()
# 0

“分配”也产生相同的结果:

df0 = df0.assign(new_col2 = pd.Series(list(zip(df0['a'], df0['b'], df0['c']))))
df0['new_col2'].isnull().sum()
# 14334

我发现了两个类似的问题,thisthis 。我怀疑我的问题也与索引有关。似乎有 89 个不相同的值:

np.sum(df0.index == new_col.index)
# 89

分配与 df0 索引相同的系列:

df0.index = new_col
df0['abc'] = df0.index
df0['abc'].isnull().sum()
# 0

更新以下是 @jezreal 解决方案的一些基准测试:

%time df0['abc'] = pd.Series(list(zip(df0['a'], df0['b'], df0['c'])), index=df0.index)
Wall time: 2.32 s

% time df0['abc'] = df0[['a','b','c']].apply(tuple, axis=1)
Wall time: 1min 42s

%time df0['abc'] = df0.set_index(['a','b','c']).index.values
Wall time: 8.68 s

% time df0['abc'] = pd.Series([tuple(x) for x in df0[['a','b','c']].values.tolist()], index=df0.index)
Wall time: 9.83 s

最佳答案

我认为需要与新Seriesdf0相同的索引来对齐数据:

df0['abc'] = pd.Series(list(zip(df0['a'], df0['b'], df0['c'])), index=df0.index)

或者使用应用:

df0['abc'] = df0[['a','b','c']].apply(tuple, axis=1)

示例:

df0 = pd.DataFrame({'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}, index=[1,1,2,2,9,10])

print (df0)
D E F a b c
1 1 5 a a 4 7
1 3 3 a b 5 8
2 5 6 a c 4 9
2 7 9 b d 5 4
9 1 2 b e 5 2
10 0 4 b f 4 3

df0['abc'] = pd.Series(list(zip(df0['a'], df0['b'], df0['c'])))

print (df0)
D E F a b c abc
1 1 5 a a 4 7 (b, 5, 8)
1 3 3 a b 5 8 (b, 5, 8)
2 5 6 a c 4 9 (c, 4, 9)
2 7 9 b d 5 4 (c, 4, 9)
9 1 2 b e 5 2 NaN
10 0 4 b f 4 3 NaN
df0['abc'] = pd.Series(list(zip(df0['a'], df0['b'], df0['c'])), index=df0.index)
<小时/>
df0['abc'] = df0[['a','b','c']].apply(tuple, axis=1)


print (df0)
D E F a b c abc
1 1 5 a a 4 7 (a, 4, 7)
1 3 3 a b 5 8 (b, 5, 8)
2 5 6 a c 4 9 (c, 4, 9)
2 7 9 b d 5 4 (d, 5, 4)
9 1 2 b e 5 2 (e, 5, 2)
10 0 4 b f 4 3 (f, 4, 3)

关于python - 使用 pd.Series() 从现有列添加新列会创建 NaN 值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47126647/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com