gpt4 book ai didi

python pandas/numpy 根据映射方案替换所有值的快速方法

转载 作者:太空狗 更新时间:2023-10-30 02:56:41 25 4
gpt4 key购买 nike

假设我有一个巨大的 Pandas 数据框/numpy 数组,其中每个元素都是一个有序值列表:

sequences = np.array([12431253, 123412531, 12341234,12431253, 145345],
[5463456, 1244562, 23452],
[243524, 141234,12431253, 456367],
[456345, 253451],
[75635, 14145, 12346,12431253])

或者,

sequences = pd.DataFrame({'sequence': [[12431253, 123412531, 12341234,12431253, 145345],
[5463456, 1244562, 23452],
[243524, 141234, 456367,12431253],
[456345, 253451],
[75635, 14145, 12346,12431253]]})

我想用另一组从 0 开始的标识符替换它们,所以我设计了一个这样的映射:

from compiler.ast import flatten
from sets import Set
mapping = pd.DataFrame({'v0': list(Set(flatten(sequences['sequence']))), 'v1': range(len(Set(flatten(sequences['sequence'])))})

……

所以我正在寻找的结果:

sequences = np.array([1, 2, 3,1, 4], [5, 6, 7], [8, 9, 10,1], [11, 12], [13, 14, 15,1])

我怎样才能将其扩展到一个巨大的数据框/numpy 序列?

非常感谢您的指导!非常感谢!

最佳答案

这是一种将1D 数组扁平化的方法,使用np.unique 为每个元素分配唯一的 ID,然后拆分回数组列表 -

lens = np.array(map(len,sequences))
seq_arr = np.concatenate(sequences)
ids = np.unique(seq_arr,return_inverse=1)[1]
out = np.split(ids,lens[:-1].cumsum())

sample 运行-

In [391]: sequences = np.array([[12431253, 123412531, 12341234,12431253, 145345],
...: [5463456, 1244562, 23452],
...: [243524, 141234,12431253, 456367],
...: [456345, 12431253],
...: [75635, 14145, 12346,12431253]])

In [392]: out
Out[392]:
[array([12, 13, 11, 12, 5]),
array([10, 9, 2]),
array([ 6, 4, 12, 8]),
array([ 7, 12]),
array([ 3, 1, 0, 12])]

In [393]: np.array(map(list,out)) # If you need NumPy array as final o/p
Out[393]:
array([[12, 13, 11, 12, 5], [10, 9, 2], [6, 4, 12, 8], [7, 12],
[3, 1, 0, 12]], dtype=object)

关于python pandas/numpy 根据映射方案替换所有值的快速方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39773480/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com