gpt4 book ai didi

python - Pandas:无需循环即可获取数据子集

转载 作者:行者123 更新时间:2023-11-30 08:58:22 25 4
gpt4 key购买 nike

我正在尝试根据customer_id将训练数据拆分为训练/测试拆分(数据框中的几行可以具有相同的customer_id),我想知道我们可以做build df_testdrop from df_train 部分而不以更 Pandas native 的方式循环?

#Split data for train / test split

df_train = pd.read_csv('data/train.csv')
print('df_train.shape', df_train.shape)

df_train = df_train.replace(np.nan, 'nan', regex=True)

train_customer_id_set = df_train.customer_id.unique()
print('len(train_customer_id_set)', len(train_customer_id_set))

#Split train data to train/test by customer_id
n = 1000
test_customer_id_set = list(train_customer_id_set)
random.shuffle(test_customer_id_set)
test_customer_id_set = test_customer_id_set[:n]

#Q: how to do it without cycle?

#build df_test
df_list = []
for customer_id in test_customer_id_set:
df = df_train[df_train['customer_id']==customer_id]
df_list.append(df)
df_test = pd.concat(df_list)

#drop from df_train
for customer_id in test_customer_id_set:
df_train = df_train.drop(df_train[df_train.customer_id==customer_id].index)

train_customer_id_set = df_train.customer_id.unique()

print('df_train.shape', df_train.shape)
print('df_test.shape', df_test.shape)

最佳答案

按照您计算 test_customer_id_set 的点,您所做的似乎相当于:

df_test = df_train[df_train.customer_id.isin(test_customer_id_set)]
df_train = df_train[~df_train.customer_id.isin(test_customer_id_set)]

关于python - Pandas:无需循环即可获取数据子集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50258944/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com