gpt4 book ai didi

python - DataFrame 中的并行排列(pandas 或 dask)

转载 作者:行者123 更新时间:2023-12-01 07:50:40 25 4
gpt4 key购买 nike

我需要计算 pandas 数据框中按列行差异的所有可能排列。

使用 itertools 排列可以工作,但对于我需要解决的大小问题,它需要花费太长时间。使用多处理时出现错误。假设错误有解决方案,“多重处理”是最佳方式还是 dask 有办法解决规模问题?

#My naive approach

import pandas as pd
import numpy as np
from itertools import permutations

columns = list(range(1,50))
index = list(range(1,10))
df = pd.DataFrame(index= index, columns = columns,data=np.random.randn(len(index),len(columns)))
count_perm = list(permutations(df.index,2))

comparison_df = pd.DataFrame(columns = df.columns)

for a,b in permutations(df.index,2):
comparison_df.loc['({} {})'.format(a,b)] = df.loc[a] - df.loc[b]

#My multiprocessing attempt

import pandas as pd
import numpy as np
from itertools import permutations
from multiprocessing.dummy import Pool as ThreadPool

columns = list(range(1,5000))
index = list(range(1,100))
df = pd.DataFrame(index= index, columns = columns,data=np.random.randn(len(index),len(columns)))
count_perm = list(permutations(df.index,2))

pool = ThreadPool(4) # Number of threads

comparison_df = pd.DataFrame(columns = df.columns)
aux_val = [(a, b) for a,b in permutations(df.index,2)]

def op(tupx):
comparison_df.loc["('{}', '{}')".format(tupx[0],tupx[1])] = (df.loc[tupx[0]] - df.loc[tupx[1]])

pool.map(op, aux_val)

错误:

Traceback (most recent call last):

File "<ipython-input-69-20c917ebefd7>", line 30, in <module>
pool.map(op, aux_val)

File "/home/justaguy/anaconda3/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()

File "/home/justaguy/anaconda3/lib/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value

File "/home/justaguy/anaconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))

File "/home/justaguy/anaconda3/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))

File "<ipython-input-69-20c917ebefd7>", line 26, in op
comparison_df.loc["('{}', '{}')".format(tupx[0],tupx[1])] = (df.loc[tupx[0]] - df.loc[tupx[1]])

File "/home/justaguy/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 190, in __setitem__
self._setitem_with_indexer(indexer, value)

File "/home/justaguy/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py", line 451, in _setitem_with_indexer
self.obj._data = self.obj.append(value)._data

File "/home/justaguy/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 6692, in append
sort=sort)

File "/home/justaguy/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 229, in concat
return op.get_result()

File "/home/justaguy/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 426, in get_result
copy=self.copy)

File "/home/justaguy/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 2065, in concatenate_block_managers
return BlockManager(blocks, axes)

File "/home/justaguy/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 114, in __init__
self._verify_integrity()

File "/home/justaguy/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 311, in _verify_integrity
construction_error(tot_items, block.shape[1:], self.axes)

File "/home/justaguy/anaconda3/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 1691, in construction_error
passed, implied))

ValueError: Shape of passed values is (604, 4999), indices imply (602, 4999)

最佳答案

正如我在评论中建议的那样,您可能会考虑使用组合而不是排列。这样做可以将计算量减少一半。 免责声明:我的代码正在计算列的差异,而不是像您的示例中那样计算索引。

import pandas as pd
import numpy as np
from itertools import permutations, combinations
import os
import multiprocessing as mp

# generate data
columns = list(range(1,50))

## I don't think you should start index at 1
index = list(range(1,10))

df = pd.DataFrame(index=index,
columns=columns,
data=np.random.randn(len(index),len(columns)))

单线程

%%timeit -n 10
df1 = pd.DataFrame()
for a,b in permutations(df.index,2):
df1["{}-{}".format(a,b)] = df[a]-df[b]
# 37.1 ms ± 726 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df1 = pd.DataFrame()
for a,b in permutations(df.index,2):
df1["{}-{}".format(a,b)] = df[a].values-df[b].values

df1.index = df1.index+1
# 25.6 ms ± 1.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

单线程 - 使用组合

%%timeit -n 10
df1 = pd.DataFrame()
for a,b in combinations(df.index,2):
df1["{}-{}".format(a,b)] = df[a]-df[b]
# 18.6 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df1 = pd.DataFrame()
for a,b in combinations(df.index,2):
df1["{}-{}".format(a,b)] = df[a].values-df[b].values

df1.index = df1.index+1
# 13.2 ms ± 819 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

多处理

在这种情况下,这不会更快,但您可以考虑将其用于其他应用程序。

def parallelize(fun, vec, cores):
with mp.Pool(cores) as p:
res = p.map(fun, vec)
return res

def fun(v):
a,b=v
cols = ["{}-{}".format(a,b)]
df_out = pd.DataFrame(data=df[a].values-df[b].values,
columns=cols)

return df_out

vec = [(a,b) for a,b in permutations(df.index,2)]
cores = os.cpu_count()

%%timeit -n 10
df1 = parallelize(fun, vec, cores)
df1 = pd.concat(df1, axis=1)
# 260 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

关于python - DataFrame 中的并行排列(pandas 或 dask),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56229659/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com