python - Python/with Pandas 中的描述性统计，括号中为 std-6ren

python - Python/with Pandas 中的描述性统计，括号中为 std

转载作者：行者123 更新时间：2023-12-03 14:38:56

这个问题涉及在 Python 中使用与学术出版物中找到的表格相对应的格式化输出进行描述性统计的最佳实践:平均值在下面的括号中带有各自的标准偏差。最终目标是能够以 Latex 表格格式(或其他格式、html 等)导出它。

示例( Deucherta & Eugster (2018) ):

Pandas :

在 Pandas 中进行描述性统计的经典解决方案是使用 describe() DataFrame的方法.

import numpy as np
import pandas as pd 

# Generate a DataFrame to have an example 
df = pd.DataFrame(
    {"Age" : np.random.normal(20,15,5), 
     "Income": np.random.pareto(1,5)*20_000 }
    )
# The describe method to get means and stds
df.describe().loc[["mean", "std"]].T
>>>
                mean            std
Age        15.322797      13.449727
Income  97755.733510  143683.686484

我想要的是以下输出:

Age        15.32 
          (13.44)
Income  97755.73  
        (143683.68)

有一个适用于多索引数据帧的解决方案会很好:

df2 = pd.DataFrame(
    {"Age" : np.random.normal(20,15,5), 
     "Income": np.random.pareto(1,5)*20_000 }
    )
df_c = pd.concat([df,df2], keys = ["A", "B"])
>>>

并得到

                A           B
Age          23.15       21.33
            (11.62)      (9.34)
Income    68415.53    46619.51
         (95612.40)  (64596.10)

我目前的解决方案:

idx = pd.IndexSlice
df_desc = (df_c
      ).groupby(level = 0, axis = 0).describe()
df_desc = df_desc.loc[idx[:],idx[:,["mean", "std"]]].T
df_desc.loc[idx[:,["std"]],idx[:]] = df_desc.loc[idx[:,["std"]],idx[:]
                                               ].applymap(
                                               lambda x: "("+"{:.2f}".format(x)+")")
print(df_desc)

>>>
                     A           B
Age    mean     23.1565     21.3359
      std      (11.62)      (9.34)
Income mean     68415.5     46619.5
      std   (95612.40)  (64596.10)

问题1:

我没有找到隐藏第二个索引列 [mean, std, mean,std] 的解决方案。

然后我想将我的 df 导出到 latex :

df_desc.to_latex()

>>>
\begin{tabular}{llll}
\toprule
       &     &            A &           B \\
\midrule
Age & mean &       5.5905 &     29.5894 \\
       & std &      (16.41) &     (13.03) \\
Income & mean &       531970 &     72653.7 \\
       & std &  (875272.44) &  (79690.18) \\
\bottomrule
\end{tabular}

问题2:
&表格的字符没有对齐，这使得编辑有点乏味(我在 VSCode 中使用扩展来对齐 &)

总的来说，我觉得这个解决方案很乏味而且不优雅。

解决方案 ?

我不知道我应该怎么做才能在没有复杂字符串操作的情况下获得所需的结果。

我看过 Pandas styling ，但我不认为这是最好的解决方案。

还有 StatModels Tables ，但我没有找到解决问题的简单方法。 Statsmodels Tables 似乎是最有前途的解决方案。但我不知道如何实现它。 StatsModels 中有一些描述性统计函数，但我在 GitHub 上读到它们在某种程度上已被弃用。

那么制作这些 table 的最佳方法是什么？

最佳答案

我刚刚遇到了类似的问题并找到了您的帖子，所以这就是我处理您提到的问题的方法。
问题1:隐藏第二个索引列
我更喜欢解决方案 b)，但为了说明目的，将 a) 留在这里。
a) droplevel & set_index

df_desc.index.droplevel(level=1)

>>>
Index(['Age', 'Age', 'Income', 'Income'], dtype='object')

将这段代码与 set_index 一起使用表达:

df_desc.set_index(df_desc.index.droplevel(level=1), inplace=True)

这导致:

print(df_desc)

>>>
                  A           B
Age         17.1527     23.9678
Age         (19.73)     (12.01)
Income       293271     12178.8
Income  (400059.27)  (14483.35)

这看起来并不令人满意，因为索引值 Age 和 Income 分别出现了两次。
这就是为什么我想出了以下内容。
b) 使用 pd.DataFrame.values 创建新的 DataFrame并手动分配索引
首先，重置索引:

df_desc = df_desc.reset_index(drop=True)

print(df_desc)

>>>
            A           B
0      17.306      11.425
1     (14.40)     (16.67)
2     88016.7     67280.4
3  (73054.44)  (54953.69)

其次，创建新的 DataFrame 手动指定索引和列名。
请注意，我使用了 df_desc.values作为数据参数(第一个位置)。

df_new = pd.DataFrame(df_desc.values, index=["Age", "", "Income", ""], columns=["A", "B"])

print(df_new)

>>>
                 A           B
Age        27.7039     20.8031
           (13.99)     (13.92)
Income     20690.7     7370.44
        (29470.03)  (13279.10)

问题 2:对齐 LaTeX 表
注意运行

df_new.to_latex()

确实产生了一个有点乱的 str输出:

>>> 
'\\begin{tabular}{lll}\n\\toprule\n{} &           A &           B \\\\\n\\midrule\nAge    &     27.7039 &     20.8031 \\\\\n       &     (13.99) &     (13.92) \\\\\nIncome &     20690.7 &     7370.44 \\\\\n       &  (29470.03) &  (13279.10) \\\\\n\\bottomrule\n\\end{tabular}\n'

但是，将其包裹在 print 中语句产生所需的输出:

print(df_new.to_latex())

>>>
\begin{tabular}{lll}
\toprule
{} &           A &           B \\
\midrule
Age    &     27.7039 &     20.8031 \\
       &     (13.99) &     (13.92) \\
Income &     20690.7 &     7370.44 \\
       &  (29470.03) &  (13279.10) \\
\bottomrule
\end{tabular}

此外，将表格导出到 LaTeX 文档相当简单。
正如您自己指出的那样， to_latex()已经创建了一个表格，所以你只需要把它写到一个文件中，然后使用 \input在您的 LaTeX 文档中。
遵循示例 here ，请执行下列操作:
i) 将表格另存为文本文件

with open('mytable.tex','w') as tf:
    tf.write(df_new.to_latex())

ii) 在 LaTeX 文档中使用导出的表格

\documentclass{article}
\usepackage{booktabs}
\begin{document}
\input{mytable}
\end{document}

此示例假定 mytable.tex和 LaTeX 文档在同一个文件夹中。 booktabs需要包，因为 to_latex()使用 booktabs表规则的命令。
最终的 pdf 输出如下所示:

关于python - Python/with Pandas 中的描述性统计，括号中为 std，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54466196/

文章推荐： html - github页面html中没有视频加载吗？

文章推荐：带有授权的 Keycloak 自定义 SPI REST 端点

url - 描述性 URL 与基本 URL
我有一个网站，我正在为所有链接使用干净的 URL。我想知道对于简短的基本 URL 与较长的描述性 URL 有何看法。例如，如果我的网站是关于 Georgia Bulldog 足球新闻的，那么哪个网站
node.js - 描述性 Hapi/Joi 验证错误
我一直在尝试在我们的 Node 应用程序中实现 Joi(joi 是独立的，而不是 hapi)，它似乎正确地验证了模式，但错误总是一样的 [ValidationError: value must be

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - Python/with Pandas 中的描述性统计，括号中为 std