python - 在 Pandas 1.2.0 或更新版本中通过相应的列标题查找值-6ren

python - 在 Pandas 1.2.0 或更新版本中通过相应的列标题查找值

转载作者：行者123 更新时间：2023-12-04 16:23:15

操作 pandas.DataFrame.lookup 是“自 1.2.0 版起已弃用”，并且此后许多以前的答案无效。
这篇文章试图充当规范资源，用于在 Pandas 1.2.0 及更新版本中查找相应的行列对。
此类问题的一些先前答案(现已弃用):

Vectorized lookup on a pandas dataframe

Python Pandas Match Vlookup columns based on header values

Using DataFrame.lookup to get rows where columns names are a subset of a string

Python: pandas: match row value to column name/ key's value

这个问题的一些当前答案:

Reference DataFrame value corresponding to column header

Pandas/Python: How to create new column based on values from other columns and apply extra condition to this new column

具有默认范围索引的标准查找值
给定以下数据帧:

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})
  Col  A  B
0   B  1  5
1   A  2  6
2   A  3  7
3   B  4  8

我希望能够在 Col 中指定的列中查找相应的值:
我希望我的结果看起来像:

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

具有非默认索引的标准查找值
非连续范围索引
给定以下数据帧:

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]}, 
                  index=[0, 2, 8, 9])

  Col  A  B
0   B  1  5
2   A  2  6
8   A  3  7
9   B  4  8

我想保留索引但仍然找到正确的对应值:

  Col  A  B  Val
0   B  1  5    5
2   A  2  6    2
8   A  3  7    3
9   B  4  8    8

多索引

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

    Col  A  B
C E   B  1  5
  F   A  2  6
D E   A  3  7
  F   B  4  8

我想保留索引但仍然找到正确的对应值:

    Col  A  B  Val
C E   B  1  5    5
  F   A  2  6    2
D E   A  3  7    3
  F   B  4  8    8

查找不匹配/未找到值的默认值
鉴于以下数据帧

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

  Col  A  B
0   B  1  5
1   A  2  6
2   A  3  7
3   C  4  8  # Column C does not correspond with any column

如果存在，我想查找相应的值，否则我希望它默认为 0

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   C  4  8    0  # Default value 0 since C does not correspond

在查找 Col 中查找缺失值
给定以下数据帧:

   Col  A  B
0    B  1  5
1    A  2  6
2    A  3  7
3  NaN  4  8  # <- Missing Lookup Key

我要任何 NaN Col 中的值导致 NaN值在 Val

   Col  A  B  Val
0    B  1  5  5.0
1    A  2  6  2.0
2    A  3  7  3.0
3  NaN  4  8  NaN  # NaN to indicate missing

最佳答案

任何索引的标准查找值
Looking up values by index/column labels 上的文档建议通过 factorize 使用 NumPy 索引和 reindex 作为已弃用的 DataFrame.lookup 的替代品.

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=[0, 2, 8, 9])

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

df

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

factorize 用于转换列将值编码为“枚举类型”。

idx, col = pd.factorize(df['Col'])
# idx = array([0, 1, 1, 0], dtype=int64)
# col = Index(['B', 'A'], dtype='object')

请注意 B对应于 0和 A对应于 1 . reindex 用于确保列以与枚举相同的顺序出现:

df.reindex(columns=col)

   B  A  # B appears First (location 0) A appers second (location 1)
0  5  1
1  6  2
2  7  3
3  8  4

我们需要创建一个与 NumPy 索引兼容的适当范围索引器。
标准方法是使用 np.arange 基于DataFrame的长度:

np.arange(len(df))

[0 1 2 3]

现在 NumPy 索引将用于从 DataFrame 中选择值:

df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

[5 2 3 8]

* 备注 :无论索引类型如何，这种方法都将始终有效。
多索引

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

    Col  A  B  Val
C E   B  1  5    5
  F   A  2  6    2
D E   A  3  7    3
  F   B  4  8    8

为什么使用 np.arange 而不是 df.index 直接地？
标准连续范围索引

import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

仅在这种情况下， np.arange 的结果没有错误与 df.index 相同. df

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   B  4  8    8

非连续范围索引错误
引发索引错误:

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=[0, 2, 8, 9])

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

IndexError: index 8 is out of bounds for axis 0 with size 4

多索引错误

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'B'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]},
                  index=pd.MultiIndex.from_product([['C', 'D'], ['E', 'F']]))

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

引发索引错误:

df['Val'] = df.reindex(columns=col).to_numpy()[df.index, idx]

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

查找不匹配/未找到值的默认值
有几种方法。
首先让我们看看如果有一个不对应的值，默认情况下会发生什么:

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', 'C'],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})
#   Col  A  B
# 0   B  1  5
# 1   A  2  6
# 2   A  3  7
# 3   C  4  8

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

  Col  A  B  Val
0   B  1  5  5.0
1   A  2  6  2.0
2   A  3  7  3.0
3   C  4  8  NaN  # NaN Represents the Missing Value in C

如果我们看看为什么 NaN值被引入，我们会发现当 factorize 通过列，它将枚举所有存在的组，无论它们是否对应于一列。
为此，当我们 reindex DataFrame 我们最终会得到以下结果:

idx, col = pd.factorize(df['Col'])
df.reindex(columns=col)

idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col)
   B  A   C
0  5  1 NaN
1  6  2 NaN
2  7  3 NaN
3  8  4 NaN  # Reindex adds the missing column with the Default `NaN`

如果我们想指定一个默认值，我们可以指定 fill_value reindex 的论据这允许我们修改与缺失列值相关的行为:

idx, col = pd.factorize(df['Col'])
df.reindex(columns=col, fill_value=0)

idx = array([0, 1, 1, 2], dtype=int64)
col = Index(['B', 'A', 'C'], dtype='object')
df.reindex(columns=col, fill_value=0)
   B  A  C
0  5  1  0
1  6  2  0
2  7  3  0
3  8  4  0  # Notice reindex adds missing column with specified value `0`

这意味着我们可以这样做:

idx, col = pd.factorize(df['Col'])
df['Val'] = df.reindex(
    columns=col, 
    fill_value=0  # Default value for Missing column values
).to_numpy()[np.arange(len(df)), idx]

df :

  Col  A  B  Val
0   B  1  5    5
1   A  2  6    2
2   A  3  7    3
3   C  4  8    0

*请注意 dtype列是 int , 自 NaN从未被引入，因此，列类型没有改变。

在查找 Col 中查找缺失值
factorize 有一个默认值 na_sentinel=-1 , 意思是当 NaN值出现在被分解的列中，结果 idx值为 -1

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})
#    Col  A  B
# 0    B  1  5
# 1    A  2  6
# 2    A  3  7
# 3  NaN  4  8  # <- Missing Lookup Key

idx, col = pd.factorize(df['Col'])
# idx = array([ 0,  1,  1, -1], dtype=int64)
# col = Index(['B', 'A'], dtype='object')
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]
#    Col  A  B  Val
# 0    B  1  5    5
# 1    A  2  6    2
# 2    A  3  7    3
# 3  NaN  4  8    4 <- Value From A

此 -1意味着，默认情况下，我们将在重新索引时从最后一列中提取。注意 col仍然只包含值 B和 A .意思是，我们最终会得到来自 A 的值。在 Val对于最后一行。
处理此问题的最简单方法是 fillna Col具有在列标题中找不到的某些值。
这里我使用空字符串 '' :

idx, col = pd.factorize(df['Col'].fillna(''))
# idx = array([0, 1, 1, 2], dtype=int64)
# col = Index(['B', 'A', ''], dtype='object')

现在，当我重新索引时， ''列将包含 NaN values 意味着查找产生所需的结果:

import numpy as np
import pandas as pd

df = pd.DataFrame({'Col': ['B', 'A', 'A', np.nan],
                   'A': [1, 2, 3, 4],
                   'B': [5, 6, 7, 8]})

idx, col = pd.factorize(df['Col'].fillna(''))
df['Val'] = df.reindex(columns=col).to_numpy()[np.arange(len(df)), idx]

df :

   Col  A  B  Val
0    B  1  5  5.0
1    A  2  6  2.0
2    A  3  7  3.0
3  NaN  4  8  NaN  # Missing as expected

关于python - 在 Pandas 1.2.0 或更新版本中通过相应的列标题查找值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/69352472/

文章推荐： javascript - 自动求和输入值并使用 Javascript 显示

文章推荐：带有重载提取器的 Scala 语言？

文章推荐： java - Spark 将数组列分解为列

java - 相应 try 语句的主体中永远不会抛出异常
我在 Java 中遇到异常处理问题，这是我的代码。当我尝试运行此行时出现编译器错误:throw new MojException("Bledne dane");。错误是: exception MojE
javascript - 如何在选中 asp.net、c# 中 tabcontainer 中的复选框时启用附近(相应)文本框
我刚刚开始学习asp.net。在你们的支持下，我希望我能从这个论坛学到更多东西。我的问题是，我在 asp.net 页面中有一个 TabContainer1，因为每个选项卡面板中有多个类似 (60)

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 在 Pandas 1.2.0 或更新版本中通过相应的列标题查找值