gpt4 book ai didi

python - 迭代 python 数据框中的列以进行计算并在现有列之间插入新列

转载 作者:太空宇宙 更新时间:2023-11-03 20:51:25 25 4
gpt4 key购买 nike

我对 python 和一般编程都很陌生,似乎找不到解决我的问题的方法。我有一个从 Excel 工作表导入的数据框,其中包含 15 行物种及其数量和 3 列(它们的位置)。这是按站划分的物种矩阵:

              A1    A2    A3
Species 1 1259 600 151
Species 2 912 1820 899
Species 3 1288 1491 631
Species 4 36 609 1946
Species 5 1639 819 1864
Species 6 1989 748 843
Species 7 688 271 1206
Species 8 1031 341 756
Species 9 1517 1164 138
Species 10 1290 669 811
Species 11 16 409 1686
Species 12 329 521 954
Species 13 1782 958 1727
Species 14 464 1804 1105
Species 15 1002 1483 109

我想计算每一列的前 10 个物种(指数)、它们的值、列中总数的百分比、累积百分比,并在每个现有列之后插入新列并在一个数据帧中返回。

这是我正在寻找的结果(例如前两列):

     Species    A1  pct  cum_pct     Species    A2  pct  cum_pct   
0 Species 6 1989 13 13 Species 2 1820 13 13
1 Species 13 1782 11 24 Species 14 1804 13 26
2 Species 5 1639 10 35 Species 3 1491 10 37
3 Species 9 1517 9 45 Species 15 1483 10 48
4 Species 10 1290 8 53 Species 9 1164 8 56
5 Species 3 1288 8 62 Species 13 958 6 63
6 Species 1 1259 8 70 Species 5 819 5 69
7 Species 8 1031 6 77 Species 6 748 5 75
8 Species 15 1002 6 83 Species 10 669 4 79
9 Species 2 912 5 89 Species 4 609 4 84

我成功地通过计算每一列并创建新的数据帧并使用 concat 最后使用以下代码将数据帧合并在一起来做到这一点:

df = pd.read_excel(r"") #local excel file

#extract first column and remove others
df = df.drop(df.columns[1:], axis=1)

# create column which has percentage for each element: divide value by total sum
df["pct"] = 100*(df.iloc[:, 0] /df.iloc[:, 0].sum())

#sort by value in Column 1 (0) return only top n (10) values
df = df.sort_values(by=df.columns[0], ascending=False).head(10)

# Create column with cumulative sum
df["cum_pct"] = df.pct.cumsum()

#make index as column and change name to Species
df = df.reset_index()

df = df.rename(index=str, columns={"index": "Species"})


# For column 2
df1 = pd.read_excel(r"") #local excel file

df1 = df1.drop(df1.columns[2:], axis=1)
df1 = df1.drop(df1.columns[0], axis=1)

# create column which has percentage for each element: divide value by total sum
df1["pct"] = 100*(df1.iloc[:, 0] /df1.iloc[:, 0].sum())


#sort by value in Column 1 (0) return only top n (10) values
df1 = df1.sort_values(by=df1.columns[0], ascending=False).head(10)

# Create column with cumulative sum
df1["cum_pct"] = df1.pct.cumsum()

# set index as first column
df1 = df1.reset_index()

df1 = df1.rename(index=str, columns={"index": "Species"})


# concatenate all dataframes
result = pd.concat([df, df1,], axis=1, join_axes=[df.index])

#convert numbers to int, exception = ignore
result = result.astype(int, errors="ignore")

print(result)

这段代码可以工作,但我的数据集要大得多,通常超过 50 列,所以我想知道是否可以对每列进行迭代,从而产生如上所示的相同数据帧。抱歉读了这么长。

最佳答案

使用for循环,Series.nlargest , DataFrame.assign使用lambda函数来计算pctcum_pctpandas.concat合并为最终输出帧:

frames = []
for col in df:
frames.append(df[col].nlargest(10).to_frame()
.assign(pct=lambda x: x[col] / df[col].sum(),
cum_pct=lambda x: x['pct'].cumsum())
.rename_axis('Species').reset_index())


df_new = pd.concat(frames, axis=1)

[输出]

      Species    A1       pct   cum_pct     Species    A2       pct   cum_pct  \
0 Species 6 1989 0.130495 0.130495 Species 2 1820 0.132779 0.132779
1 Species 13 1782 0.116914 0.247408 Species 14 1804 0.131612 0.264390
2 Species 5 1639 0.107532 0.354940 Species 3 1491 0.108777 0.373167
3 Species 9 1517 0.099528 0.454468 Species 15 1483 0.108193 0.481360
4 Species 10 1290 0.084635 0.539102 Species 9 1164 0.084920 0.566280
5 Species 3 1288 0.084503 0.623606 Species 13 958 0.069891 0.636171
6 Species 1 1259 0.082601 0.706207 Species 5 819 0.059750 0.695922
7 Species 8 1031 0.067642 0.773849 Species 6 748 0.054571 0.750492
8 Species 15 1002 0.065739 0.839588 Species 10 669 0.048807 0.799300
9 Species 2 912 0.059835 0.899423 Species 4 609 0.044430 0.843729

Species A3 pct cum_pct
0 Species 4 1946 0.131256 0.131256
1 Species 5 1864 0.125725 0.256981
2 Species 13 1727 0.116485 0.373466
3 Species 11 1686 0.113719 0.487185
4 Species 7 1206 0.081344 0.568528
5 Species 14 1105 0.074531 0.643059
6 Species 12 954 0.064346 0.707406
7 Species 2 899 0.060637 0.768043
8 Species 6 843 0.056860 0.824902
9 Species 10 811 0.054701 0.879603
<小时/>

如果需要将计算字段 pctcum_pct 格式化为 int,请改为使用:

frames = []
for col in df:
frames.append(df[col].nlargest(10).to_frame()
.assign(pct=lambda x: x[col] / df[col].sum(),
cum_pct=lambda x: x['pct'].cumsum())
.assign(pct=lambda x: x['pct'].mul(100).astype(int),
cum_pct=lambda x: x['cum_pct'].mul(100).astype(int))
.rename_axis('Species').reset_index())


df_new = pd.concat(frames, axis=1)

[输出]

     Species    A1  pct  cum_pct     Species    A2  pct  cum_pct     Species  \
0 Species 6 1989 13 13 Species 2 1820 13 13 Species 4
1 Species 13 1782 11 24 Species 14 1804 13 26 Species 5
2 Species 5 1639 10 35 Species 3 1491 10 37 Species 13
3 Species 9 1517 9 45 Species 15 1483 10 48 Species 11
4 Species 10 1290 8 53 Species 9 1164 8 56 Species 7
5 Species 3 1288 8 62 Species 13 958 6 63 Species 14
6 Species 1 1259 8 70 Species 5 819 5 69 Species 12
7 Species 8 1031 6 77 Species 6 748 5 75 Species 2
8 Species 15 1002 6 83 Species 10 669 4 79 Species 6
9 Species 2 912 5 89 Species 4 609 4 84 Species 10

A3 pct cum_pct
0 1946 13 13
1 1864 12 25
2 1727 11 37
3 1686 11 48
4 1206 8 56
5 1105 7 64
6 954 6 70
7 899 6 76
8 843 5 82
9 811 5 87

关于python - 迭代 python 数据框中的列以进行计算并在现有列之间插入新列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56289734/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com