gpt4 book ai didi

machine-learning - 如何使用哈希码方法将数据集拆分为训练和测试数据集

转载 作者:行者123 更新时间:2023-12-05 06:22:58 27 4
gpt4 key购买 nike

我遵循 Hands on Machine learning with Sci-kit learn and tensorflow 2nd edition 的代码.在创建训练和测试数据集部分,他们按照以下过程创建训练和测试数据集:

from zlib import crc32

def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

据作者说:

You can compute a hash of each instance's identifier and put that instance in the test set if the hash is lower than or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset. The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.

因此,我想了解这行代码的作用:crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

非常感谢任何帮助!

最佳答案

这可能有点晚了,但如果您仍在寻找答案,这里是 documentation crc32 函数:

Changed in version 3.0: Always returns an unsigned value. To generate the same numeric value across all Python versions and platforms, use crc32(data) & 0xffffffff.

因此,从本质上讲,它只是为了确保无论谁运行此函数,他们运行的是 Python 2 还是 3 都无关紧要。

关于machine-learning - 如何使用哈希码方法将数据集拆分为训练和测试数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58811081/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com