gpt4 book ai didi

python - 从稀疏矩阵中提取项目

转载 作者:太空宇宙 更新时间:2023-11-03 16:01:20 24 4
gpt4 key购买 nike

我正在处理一系列文本语料库,为此我需要构建一个共现矩阵。我目前正在测试编写和测试我的代码,因此每次运行时我都会得到一个不同的矩阵(因为 list(set()) 是无序的。我已经使用 scipy.sparse.coo_matrix() 构造了一个稀疏矩阵,并且希望能够使用坐标和由这种类型的构造生成的值。我想这将是最快且最有效的内存效率。当我尝试访问这些值时,我会看到

[<1x16 sparse matrix of type '<class 'numpy.float32'>'
with 10 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'
with 4 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'
with 4 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'
with 7 stored elements in Compressed Sparse Row format>, <1x16 sparse matrix of type '<class 'numpy.float32'>'

当我 print稀疏矩阵我得到以下结果:

  (0, 1)    0.5
(0, 4) 1.0
(0, 6) 0.5
(1, 7) 1.0
(1, 11) 1.0
(1, 12) 1.0
(1, 13) 0.5
(2, 14) 0.5
...
(15, 6) 1.0
(15, 9) 0.5
(15, 15) 3.0
(15, 0) 2.0
(15, 1) 0.5
(15, 6) 0.5
(15, 14) 1.5

我想检索这些值出现时是可能的。

对于上面的示例,我提取了以下实例:

row = [0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 
4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8,
9, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13,
13, 13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15,
15, 15, 15, 15, 15, 15, 15]

column = [1, 4, 6, 7, 11, 12, 13, 14, 15, 0, 4, 9, 12, 13, 14, 15, 4, 5, 12, 13,
4, 9, 13, 14, 0, 1, 2, 3, 5, 8, 10, 12, 13, 14, 2, 4, 12, 13, 0, 14,
15, 0, 8, 11, 13, 4, 7, 10, 11, 1, 3, 12, 14, 4, 8, 11, 13, 0, 7, 8,
10, 0, 1, 2, 4, 5, 9, 13, 0, 1, 2, 3, 4, 5, 7, 10, 12, 0, 1, 3, 4, 6,
9, 15, 0, 1, 6, 14]

values = [0.5, 1.0, 0.5, 1.0, 1.0, 1.0, 0.5, 0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 0.5,
1.0, 0.5, 1.0, 0.5, 1.0, 0.5, 0.5, 1.0, 0.5, 1.0, 1.0, 1.0, 1.0, 0.5,
0.5, 1.0, 0.5, 0.5, 1.0, 1.0, 1.5, 2.0, 1.0, 2.5, 1.0, 3.0, 1.0, 0.5,
1.5, 2.0, 1.0, 1.0, 2.0, 0.5, 1.0, 0.5, 2.0, 2.0, 0.5, 4.0, 0.5, 0.5,
0.5, 1.0, 1.0, 0.5, 0.5, 1.0, 0.5, 1.0, 1.0, 0.5, 0.5, 0.5, 2.5, 1.0,
4.0, 1.0, 1.0, 1.5, 1.0, 1.0, 1.0, 0.5, 1.0, 0.5, 1.0, 1.0, 0.5, 3.0,
2.0, 0.5, 0.5, 1.5]

sps_array = sparse.coo_matrix((values, (row, column)), shape=(16, 16))

此刻我能够转变sps_array使用sps_array.toarray然后创建一个列表,其中

list1 = list(np.nonzero(sps_array > 0)[0])
list2 = list(np.nonzero(sps_array > 0)[1])

并创建以下for循环重建坐标

index = 0
sps_coordinates = []

for i in range(token_size):
for j in range(list1_count[i]):
sps_coordinates.append((list1[index+j], list2[index+j]))
index += list1_count[i]

我通过

检索值
list(sps_array[sps_array > 0]

是否有更有效的方法来获取相对于我所做的事情的坐标和值?

最佳答案

通过复制粘贴,我构建了您的 sps_array:

In [2126]: sps_array
Out[2126]:
<16x16 sparse matrix of type '<class 'numpy.float64'>'
with 88 stored elements in COOrdinate format>

coo 格式将其值存储在 3 个属性中,每个属性都是一个数组(从 3 个输入列表派生):

In [2127]: sps_array.data
Out[2127]:
array([ 0.5, 1. , 0.5, 1. , 1. , 1. , 0.5, 0.5, 1. , 1. , 0.5,
0.5, 1. , 0.5, 1. , 0.5, 1. , 0.5, 1. , 0.5, 0.5, 1. ,
0.5, 1. , 1. , 1. , 1. , 0.5, 0.5, 1. , 0.5, 0.5, 1. ,
1. , 1.5, 2. , 1. , 2.5, 1. , 3. , 1. , 0.5, 1.5, 2. ,
1. , 1. , 2. , 0.5, 1. , 0.5, 2. , 2. , 0.5, 4. , 0.5,
0.5, 0.5, 1. , 1. , 0.5, 0.5, 1. , 0.5, 1. , 1. , 0.5,
0.5, 0.5, 2.5, 1. , 4. , 1. , 1. , 1.5, 1. , 1. , 1. ,
0.5, 1. , 0.5, 1. , 1. , 0.5, 3. , 2. , 0.5, 0.5, 1.5])
In [2128]: sps_array.row
Out[2128]:
array([ 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 7, 7, 7, 7, 8, 8, 8, 8, 9, 9, 9, 9, 10,
10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13,
13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15], dtype=int32)
In [2129]: sps_array.col
Out[2129]:
array([ 1, 4, 6, 7, 11, 12, 13, 14, 15, 0, 4, 9, 12, 13, 14, 15, 4,
5, 12, 13, 4, 9, 13, 14, 0, 1, 2, 3, 5, 8, 10, 12, 13, 14,
2, 4, 12, 13, 0, 14, 15, 0, 8, 11, 13, 4, 7, 10, 11, 1, 3,
12, 14, 4, 8, 11, 13, 0, 7, 8, 10, 0, 1, 2, 4, 5, 9, 13,
0, 1, 2, 3, 4, 5, 7, 10, 12, 0, 1, 3, 4, 6, 9, 15, 0,
1, 6, 14], dtype=int32)

稀疏矩阵有一个非零方法,其代码为:

    A = self.tocoo()
nz_mask = A.data != 0
return (A.row[nz_mask],A.col[nz_mask])

它确保矩阵采用coo格式,确保数据中没有任何“隐藏”零,并返回 col 属性。

如果您的矩阵已经是 coo 格式,则不需要此操作,但如果矩阵采用 csr 格式,则需要此操作。

因此,您无需执行密集的 toarraynp.nonzero 函数。不过,np.nonzero(sps_array) 确实可以工作,因为它将任务委托(delegate)给 sps.array.nonzero()

转置应用于非零会得到一个可能是您想要的数组:

In [2136]: np.transpose(np.nonzero(sps_array))
Out[2136]:
array([[ 0, 1],
[ 0, 4],
[ 0, 6],
[ 1, 7],
[ 1, 11],
[ 1, 12],
....

事实上,有一个 np 函数可以执行此操作(对于任何数组)(查看其代码或文档):

np.argwhere(sps_array)

(您不需要使用 nonzero(sps_array>0) - 除非您担心负值。)

关于python - 从稀疏矩阵中提取项目,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40313886/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com