gpt4 book ai didi

arrays - 在 Julia 中将数组拆分为训练集和测试集的有效方法是什么?

转载 作者:行者123 更新时间:2023-11-30 08:54:46 25 4
gpt4 key购买 nike

因此,我在 Julia 中运行机器学习算法,机器上的备用内存有限。不管怎样,我注意到我在存储库中使用的代码中有一个相当大的瓶颈。似乎(随机)分割数组比从磁盘读取文件花费的时间更长,这似乎凸显了代码的低效率。正如我之前所说,任何加速此功能的技巧都将不胜感激。原函数可参见here 。由于它是一个简短的函数,我也会将其发布在下面。

# Split a list of ratings into a training and test set, with at most
# target_percentage * length(ratings) in the test set. The property we want to
# preserve is: any user in some rating in the original set of ratings is also
# in the training set and any item in some rating in the original set of ratings
# is also in the training set. We preserve this property by iterating through
# the ratings in random order, only adding an item to the test set only if we
# haven't already hit target_percentage and we've already seen both the user
# and the item in some other ratings.
function split_ratings(ratings::Array{Rating,1},
target_percentage=0.10)
seen_users = Set()
seen_items = Set()
training_set = (Rating)[]
test_set = (Rating)[]
shuffled = shuffle(ratings)
for rating in shuffled
if in(rating.user, seen_users) && in(rating.item, seen_items) && length(test_set) < target_percentage * length(shuffled)
push!(test_set, rating)
else
push!(training_set, rating)
end
push!(seen_users, rating.user)
push!(seen_items, rating.item)
end
return training_set, test_set
end

如前所述,无论如何我可以推送数据,我将不胜感激。我还要指出的是,我实际上并不需要保留删除重复项的能力,但这将是一个很好的功能。另外,如果这已经在 J​​ulia 库中实现,我将很高兴了解它。任何利用 Julia 并行能力的解决方案都会加分!

最佳答案

就内存而言,这是我能想到的最高效的代码。

function splitratings(ratings::Array{Rating,1}, target_percentage=0.10)
N = length(ratings)
splitindex = round(Integer, target_percentage * N)
shuffle!(ratings) #This shuffles in place which avoids the allocation of another array!
return sub(ratings, splitindex+1:N), sub(ratings, 1:splitindex) #This makes subarrays instead of copying the original array!
end

然而,Julia 极其缓慢的文件 IO 现在成为了瓶颈。该算法在包含 1.7 亿个元素的数组上运行大约需要 20 秒,因此我认为它的性能相当不错。

关于arrays - 在 Julia 中将数组拆分为训练集和测试集的有效方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37036757/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com