gpt4 book ai didi

python - Pandas crosstab() 函数与包含 NaN 值的数据框的混淆行为

转载 作者:太空狗 更新时间:2023-10-30 02:18:55 27 4
gpt4 key购买 nike

我将 Python 3.4.1 与 numpy 0.10.1 和 pandas 0.17.0 一起使用。我有一个大型数据框,其中列出了个体动物的物种和性别。这是一个真实世界的数据集,不可避免地存在由 NaN 表示的缺失值。数据的简化版本可以生成为:

import numpy as np
import pandas as pd
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'species': ["dog","dog",np.nan,"dog","dog","cat","cat","cat","dog","cat","cat","dog","dog","dog","dog",np.nan,"cat","cat","dog","dog"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"]})

打印数据框给出:

    gender  id species
0 male 1 dog
1 female 2 dog
2 female 3 NaN
3 male 4 dog
4 male 5 dog
5 female 6 cat
6 female 7 cat
7 NaN 8 cat
8 male 9 dog
9 male 10 cat
10 female 11 cat
11 male 12 dog
12 female 13 dog
13 female 14 dog
14 male 15 dog
15 female 16 NaN
16 male 17 cat
17 female 18 cat
18 NaN 19 dog
19 male 20 dog

我想使用以下内容生成一个交叉表来显示每个物种的雄性和雌性数量:

pd.crosstab(tempDF['species'],tempDF['gender'])

这会产生下表:

gender   female  male
species
cat 4 2
dog 3 7

这是我所期望的。但是,如果我包含 margins=True 选项,它会产生:

pd.crosstab(tempDF['species'],tempDF['gender'],margins=True)

gender female male All
species
cat 4 2 7
dog 3 7 11
All 9 9 20

如您所见,边际总数似乎不正确,可能是数据框中缺少数据造成的。这是有意的行为吗?在我看来,这似乎很困惑。当然,边际总计应该是表中出现的行和列的总计,并且不包括表中未显示的任何缺失数据。包括 dropna=False 不会影响结果。

我可以在创建表之前删除带有 NaN 的任何行,但这似乎是很多额外的工作,并且在进行分析时需要考虑很多额外的事情。我应该将此报告为错误吗?

最佳答案

我想一种解决方法是在创建表之前将 NaN 转换为“缺失”,然后交叉管将包括专门针对缺失值的列和行:

pd.crosstab(tempDF['species'].fillna('missing'),tempDF['gender'].fillna('missing'),margins=True)

gender female male missing All
species
cat 4 2 1 7
dog 3 7 1 11
missing 2 0 0 2
All 9 9 2 20

就个人而言,我希望看到默认行为,这样我就不必记住在每个交叉表计算中替换所有 NaN。

关于python - Pandas crosstab() 函数与包含 NaN 值的数据框的混淆行为,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33303314/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com