gpt4 book ai didi

python - 我怎样才能处理巨大的矩阵?

转载 作者:太空狗 更新时间:2023-10-30 01:06:51 32 4
gpt4 key购买 nike

我正在通过监督学习执行主题检测。但是,我的矩阵非常大 (202180 x 15000),我无法将它们放入我想要的模型中。大多数矩阵由零组成。只有逻辑回归有效。有没有一种方法可以让我继续使用相同的矩阵,但又能让它们使用我想要的模型?比如我可以用不同的方式创建我的矩阵吗?

这是我的代码:

import numpy as np
import subprocess
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression

from sklearn import metrics

def run(command):
output = subprocess.check_output(command, shell=True)
return output

加载词汇

 f = open('/Users/win/Documents/wholedata/RightVo.txt','r')
vocab_temp = f.read().split()
f.close()
col = len(vocab_temp)
print("Training column size:")
print(col)

创建火车矩阵

row = run('cat '+'/Users/win/Documents/wholedata/X_tr.txt'+" | wc -l").split()[0]
print("Training row size:")
print(row)
matrix_tmp = np.zeros((int(row),col), dtype=np.int64)
print("Train Matrix size:")
print(matrix_tmp.size)

label_tmp = np.zeros((int(row)), dtype=np.int64)
f = open('/Users/win/Documents/wholedata/X_tr.txt','r')
count = 0
for line in f:
line_tmp = line.split()
#print(line_tmp)
for word in line_tmp[0:]:
if word not in vocab_temp:
continue
matrix_tmp[count][vocab_temp.index(word)] = 1
count = count + 1
f.close()
print("Train matrix is:\n ")
print(matrix_tmp)
print(label_tmp)
print("Train Label size:")
print(len(label_tmp))

f = open('/Users/win/Documents/wholedata/RightVo.txt','r')
vocab_tmp = f.read().split()
f.close()
col = len(vocab_tmp)
print("Test column size:")
print(col)

制作测试矩阵

row = run('cat '+'/Users/win/Documents/wholedata/X_te.txt'+" | wc -l").split()[0]
print("Test row size:")
print(row)
matrix_tmp_test = np.zeros((int(row),col), dtype=np.int64)
print("Test matrix size:")
print(matrix_tmp_test.size)

label_tmp_test = np.zeros((int(row)), dtype=np.int64)

f = open('/Users/win/Documents/wholedata/X_te.txt','r')
count = 0
for line in f:
line_tmp = line.split()
#print(line_tmp)
for word in line_tmp[0:]:
if word not in vocab_tmp:
continue
matrix_tmp_test[count][vocab_tmp.index(word)] = 1
count = count + 1
f.close()
print("Test Matrix is: \n")
print(matrix_tmp_test)
print(label_tmp_test)

print("Test Label Size:")
print(len(label_tmp_test))

xtrain=[]
with open("/Users/win/Documents/wholedata/Y_te.txt") as filer:
for line in filer:
xtrain.append(line.strip().split())
xtrain= np.ravel(xtrain)
label_tmp_test=xtrain

ytrain=[]
with open("/Users/win/Documents/wholedata/Y_tr.txt") as filer:
for line in filer:
ytrain.append(line.strip().split())
ytrain = np.ravel(ytrain)
label_tmp=ytrain

加载监督模型

model = LogisticRegression()
model = model.fit(matrix_tmp, label_tmp)
#print(model)
print("Entered 1")
y_train_pred = model.predict(matrix_tmp_test)
print("Entered 2")
print(metrics.accuracy_score(label_tmp_test, y_train_pred))

最佳答案

您可以使用 scipy 包中可用的特定数据结构,称为稀疏矩阵:http://docs.scipy.org/doc/scipy/reference/sparse.html

根据definition :

A sparse matrix is simply a matrix with a large number of zero values. In contrast, a matrix where many or most entries are non-zero is said to be dense. There are no strict rules for what constitutes a sparse matrix, so we'll say that a matrix is sparse if there is some benefit to exploiting its sparsity. Additionally, there are a variety of sparse matrix formats which are designed to exploit different sparsity patterns (the structure of non-zero values in a sparse matrix) and different methods for accessing and manipulating matrix entries.

关于python - 我怎样才能处理巨大的矩阵?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34824782/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com