gpt4 book ai didi

python - 在python函数中拆分数据时保持比率

转载 作者:太空狗 更新时间:2023-10-29 18:07:17 26 4
gpt4 key购买 nike

我有一些数据,我想将其分成更小的组以保持相同的比例。我写了一个函数,它将接受两个数组的输入并计算大小比,然后告诉我可以将它分成多少组的选项(如果所有组的大小都相同),函数如下:

def cross_validation_group(train_data, test_data):
import numpy as np
from calculator import factors
test_length = len(test_data)
train_length = len(train_data)
total_length = test_length + train_length
ratio = test_length/float(total_length)
possibilities = factors(total_length)
print possibilities
print possibilities[len(possibilities)-1] * ratio
super_count = 0
for i in possibilities:
if i < len(possibilities)/2:
pass
else:
attempt = float(i * ratio)
if attempt.is_integer():
print str(i) + " is an option for total size with " + str(attempt) + " as test size and " + str(i - attempt) + " as train size! This is with " + str(total_length/i) + " folds."
else:
pass
folds = int(raw_input("So how many folds would you like to use? If no possibilities were given that would be sufficient, type 0: "))
if folds != 0:
total_size = total_length/folds
test_size = float(total_size * ratio)
train_size = total_size - test_size
columns = train_data[0]
columns= len(columns)
groups = np.empty((folds,(test_size + train_size),columns))
i = 0
a = 0
b = 0
for j in range (0,folds):
test_size_new = test_size * (j + 1)
train_size_new = train_size * j
total_size_new = (train_size + test_size) * (j + 1)
cut_off = total_size_new - train_size
p = 0
while i < total_size_new:
if i < cut_off:
groups[j,p] = test_data[a]
a += 1
else:
groups[j,p] = train_data[b]
b += 1
i += 1
p += 1
return groups
else:
print "This method cannot be used because the ratio cannot be maintained with equal group sizes other than for the options you were givens"

所以我的问题是我怎样才能使函数的第三个输入将是折叠的数量并改变函数,而不是迭代以确保每个组与正确的比例,它只有正确的比例,但大小不同?

@JamesHolderness 添加

所以你的方法几乎是完美的,但这里有一个问题:

长度为 357 和 143,折叠 9 次,这是返回列表:

[(39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16), (39, 16)]

现在,当您将列相加时,您会得到:351 144

351 可以,因为它小于 357,但是 144 不行,因为它大于 143!原因是357和143是数组的长度,所以那个数组的第144行不存在...

最佳答案

这是我认为可能适合您的算法。

您将 test_length 和 train_length 除以它们的 GCD 以获得简单分数的比率。您取分子和分母并将它们加在一起,这就是您的组的大小因子。

例如,如果比例为 3:2,则每个组的大小必须是 5 的倍数。

然后您将 total_length 除以折叠数以获得第一组的理想大小,这很可能是一个 float 。您找到小于或等于该值的最大 5 的倍数,这就是您的第一组。

从总数中减去该值,然后除以 folds-1 以获得下一组的理想尺寸。再次找到 5 的最大倍数,从总数中减去 ,然后继续,直到计算完所有组。

一些示例代码:

total_length = test_length + train_length          
divisor = gcd(test_length,train_length)
test_multiple = test_length/divisor
train_multiple = train_length/divisor
total_multiple = test_multiple + train_multiple

# Adjust the ratio if there isn't enough data for the requested folds
if total_length/total_multiple < folds:
total_multiple = total_length/folds
test_multiple = int(round(float(test_length)*total_multiple/total_length))
train_multiple = total_multiple - test_multiple

groups = []
for i in range(folds,0,-1):
float_size = float(total_length)/i
int_size = int(float_size/total_multiple)*total_multiple
test_size = int_size*test_multiple/total_multiple
train_size = int_size*train_multiple/total_multiple
test_length -= test_size # keep track of the test data used
train_length -= train_size # keep track of the train data used
total_length -= int_size
groups.append((test_size,train_size))

# If the test_length or train_length are negative, we need to adjust the groups
# to "give back" some of the data.
distribute_overrun(groups,test_length,0)
distribute_overrun(groups,train_length,1)

这已经更新以跟踪每个组(测试和训练)使用的大小,但如果我们最初使用太多,请不要担心。

最后,如果有任何超限(即 test_lengthtrain_length 变为负值),我们通过递减使超限回到零所需的尽可能多的项目的比率。

distribute_overrun 函数包含在下面。

def distribute_overrun(groups,overrun,part):
i = 0
while overrun < 0:
group = list(groups[i])
group[part] -= 1
groups[i] = tuple(group)
overrun += 1
i += 1

最后,groups 将是一个元组列表,其中包含每个组的 test_size 和 train_size。

如果这听起来像您想要的那种东西,但您需要我扩展代码示例,请告诉我。

关于python - 在python函数中拆分数据时保持比率,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16094099/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com