gpt4 book ai didi

python - 在 Numpy 中高效构建稀疏双邻接矩阵

转载 作者:太空宇宙 更新时间:2023-11-04 01:11:59 24 4
gpt4 key购买 nike

我正在尝试将此 CSV 文件加载到一个稀疏的 numpy 矩阵中,该矩阵将表示此用户到 subreddit 二分图的双邻接矩阵:http://figshare.com/articles/reddit_user_posting_behavior/874101

这是一个示例:

603,politics,trees,pics
604,Metal,AskReddit,tattoos,redditguild,WTF,cocktails,pics,funny,gaming,Fitness,mcservers,TeraOnline,GetMotivated,itookapicture,Paleo,trackers,Minecraft,gainit
605,politics,IAmA,AdviceAnimals,movies,smallbusiness,Republican,todayilearned,AskReddit,WTF,IWantOut,pics,funny,DIY,Frugal,relationships,atheism,Jeep,Music,grandrapids,reddit.com,videos,yoga,GetMotivated,bestof,ShitRedditSays,gifs,technology,aww

有 876,961 行(每个用户一个)和 15,122 个子版 block 以及总共 8,495,597 个用户到子版 block 的关联。

这是我现在拥有的代码,在我的 MacBook Pro 上运行需要 20 分钟:

import numpy as np
from scipy.sparse import csr_matrix

row_list = []
entry_count = 0
all_reddits = set()
with open("reddit_user_posting_behavior.csv", 'r') as f:
for x in f:
pieces = x.rstrip().split(",")
user = pieces[0]
reddits = pieces[1:]
entry_count += len(reddits)
for r in reddits: all_reddits.add(r)
row_list.append(np.array(reddits))

reddits_list = np.array(list(all_reddits))

# 5s to get this far

rows = np.zeros((entry_count,))
cols = np.zeros((entry_count,))
data = np.ones((entry_count,))
i=0
user_idx = 0
for row in row_list:
for reddit_idx in np.nonzero(np.in1d(reddits_list,row))[0]:
cols[i] = user_idx
rows[i] = reddit_idx
i+=1
user_idx+=1
adj = csr_matrix( (data,(rows,cols)), shape=(len(reddits_list), len(row_list)) )

似乎很难相信这是最快的速度...将 82MB 文件加载到列表列表中需要 5 秒,但构建稀疏矩阵需要 200 倍。我该怎么做才能加快速度?是否有某种文件格式可以让我在不到 20 分钟的时间内将此 CSV 转换为导入速度更快的文件格式?我在这里做的一些明显昂贵的操作不好吗?我试过构建一个密集矩阵,我试过创建一个 lil_matrix 和一个 dok_matrix 并一次分配一个 1 ,这并不快。

最佳答案

无法休眠,尝试了最后一件事......我最终能够以这种方式将它缩短到 10 秒:

import numpy as np
from scipy.sparse import csr_matrix

user_ids = []
subreddit_ids = []
subreddits = {}
i=0
with open("reddit_user_posting_behavior.csv", 'r') as f:
for line in f:
for sr in line.rstrip().split(",")[1:]:
if sr not in subreddits:
subreddits[sr] = len(subreddits)
user_ids.append(i)
subreddit_ids.append(subreddits[sr])
i+=1

adj = csr_matrix(
( np.ones((len(userids),)), (np.array(subreddit_ids),np.array(user_ids)) ),
shape=(len(subreddits), i) )

关于python - 在 Numpy 中高效构建稀疏双邻接矩阵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27160867/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com