gpt4 book ai didi

python - 基于 Pandas 数据框替换 numpy 二维数组中的值

转载 作者:太空宇宙 更新时间:2023-11-03 11:25:44 25 4
gpt4 key购买 nike

>>> arr
array([[ 0., 10., 0., ..., 0., 0., 0.],
[ 0., 4., 0., ..., 6., 0., 9.],
[ 0., 0., 0., ..., 0., 0., 0.],
...,
[ 0., 0., 0., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 2., 0., 0.],
[ 0., 0., 0., ..., 0., 3., 0.]])

在上面的 numpy 数组中,我想用 df_A 中的 continent_codes 列中的值替换数据框 (df_A) 中与列 country_codes 匹配的每个值. df_A 看起来像:

  country_codes   continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5

现在,我循环遍历数据框并使用 numpy 索引符号进行替换。鉴于 iterrows() 往往很慢,是否有更直接/矢量化的方式来做到这一点?

for index, row in self.df_A.iterrows():
arr[arr == row['country_codes']] = row['continent_codes']

最佳答案

方法 #1: 一种使用 np.searchsorted 的矢量化方法和 np.in1d将如下所列 -

# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])

# Mask of elements to be changed
mask = np.in1d(arr,oldval)

# Indices for each match from oldval in arr
idx = np.searchsorted(oldval,arr.ravel()[mask])

# Using the mask put selective elements from continent_codes column into arr
arr.ravel()[mask] = newval[idx]

sample 运行-

>>> arr   # Original 2D array
array([[23, 4, 23, 5, 8],
[ 3, 6, 8, 5, 11],
[16, 24, 15, 4, 10],
[ 4, 16, 10, 8, 1]])
>>> df
country_codes continent_codes
0 4 4
1 8 3
2 12 5
3 16 6
4 24 5

>>> oldval = np.array(df['country_codes'])
>>> newval = np.array(df['continent_codes'])
>>> mask = np.in1d(arr,oldval)
>>> idx = np.searchsorted(oldval,arr.ravel()[mask])
>>> arr.ravel()[mask] = newval[idx]

>>> mask.reshape(arr.shape) # Mask array depiciting which elements were updated
array([[False, True, False, False, True],
[False, False, True, False, False],
[ True, True, False, True, False],
[ True, True, False, True, False]], dtype=bool)
>>> arr # Updated 2D array
array([[23, 4, 23, 5, 3],
[ 3, 6, 3, 5, 11],
[ 6, 5, 15, 4, 10],
[ 4, 6, 10, 3, 1]])

方法 #2: 作为变体,您还可以通过比较 np.searchsorted(oldval,arr,'left')np.searchsorted(oldval,arr,'right')this question 的解决方案中所述并在稍后再次使用 np.searchsorted(oldval,arr,'left') 并将值放入 arr 以获得更有效的解决方案,就像这样 -

# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])

# Left and right indices for each match from oldval in arr
left_idx = np.searchsorted(oldval,arr,'left')
right_idx = np.searchsorted(oldval,arr,'right')

# Mask of elements to be changed
mask = left_idx!=right_idx

# Using the mask put selective elements from continent_codes column into arr
arr[mask] = newval[left_idx[mask]]

运行时测试和验证输出

函数定义-

def original_app(arr,df):
for index, row in df.iterrows():
arr[arr == row['country_codes']] = row['continent_codes']

def vectorized_app1(arr,df):
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
mask = np.in1d(arr,oldval)
idx = np.searchsorted(oldval,arr.ravel()[mask])
arr.ravel()[mask] = newval[idx]

def vectorized_app2(arr,df):
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])
left_idx = np.searchsorted(oldval,arr,'left')
right_idx = np.searchsorted(oldval,arr,'right')
mask = left_idx!=right_idx
arr[mask] = newval[left_idx[mask]]

验证输出-

In [195]: # Input array
...: arr = np.random.randint(0,100000,(1000,1000))
...:
...: # Setup input dataframe
...: N = 1000
...: oldvals = np.unique(np.random.randint(0,100000,N))
...: newvals = np.random.randint(0,9,(oldvals.size))
...: df=pd.DataFrame({'country_codes':oldvals,'continent_codes':newvals})
...: df = df.reindex_axis(sorted(df.columns)[::-1], axis=1)
...:
...: # Make copies for input array for funcs to update them
...: arrc1 = arr.copy()
...: arrc2 = arr.copy()
...: arrc3 = arr.copy()
...:

In [196]: # Verify outputs
...: original_app(arrc1,df)
...: vectorized_app1(arrc2,df)
...: vectorized_app2(arrc3,df)
...:

In [197]: np.allclose(arrc1,arrc2)
Out[197]: True

In [198]: np.allclose(arrc1,arrc3)
Out[198]: True

时间 -

In [199]: # Make copies for input array for funcs to update them
...: arrc1 = arr.copy()
...: arrc2 = arr.copy()
...: arrc3 = arr.copy()
...:

In [200]: %timeit original_app(arrc1,df)
1 loops, best of 3: 2.79 s per loop

In [201]: %timeit vectorized_app1(arrc2,df)
1 loops, best of 3: 360 ms per loop

In [202]: %timeit vectorized_app2(arrc3,df)
1 loops, best of 3: 213 ms per loop

关于python - 基于 Pandas 数据框替换 numpy 二维数组中的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34321025/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com