gpt4 book ai didi

How do I GroupShuffleSplit a parquet dataframe lazily?(如何将拼接数据帧懒洋洋地拆分?)

转载 作者:bug小助手 更新时间:2023-10-27 21:02:24 27 4
gpt4 key购买 nike



I have a parquet dataset that looks like this (I'm using polars, but any dataframe library is fine):

我有一个如下所示的拼图数据集(我使用的是Polar,但任何数据框库都可以):


df = pl.DataFrame(
{
"match_id": [
1, 1, 1,
2, 2, 2, 2,
3, 3, 3, 3,
],
"team_id": [
1, 2, 2,
1, 1, 2, 2,
1, 2, 2, 3,
],
"player_name": [
"kevin", "james", "kelly",
"john", "jenny", "jim", "josh",
"alice", "kevin", "lilly", "erica",
],
}
)


I would like to group by match_id and test train split such that 80% of matches are in training set, and the rest in test set. So something like this:

我希望通过Match_id和测试训练拆分进行分组,这样80%的匹配项在训练集中,其余的在测试集中。所以大概是这样的:


group_df = df.group_by(["match_id"])
train, test = group_split(group_df, test_size=0.20)

I need a python solution, preferably with dask, pandas or another dataframe library. Currently pandas doesn't support lazy evaluation, as the dataset is quite large. So it seems out of the question to use pandas. Dask on the other hand doesn't support any of the sklearn.model_selection splitters since it doesn't have integer based indexing support.

我需要一个蟒蛇解决方案,最好是与达斯克,熊猫或其他数据框库。目前熊猫不支持懒惰评估,因为数据集相当大。因此,使用熊猫似乎是不可能的。另一方面,DASK不支持任何sklearn.Model_SELECTION拆分器,因为它不支持基于整数的索引。


Ideally a simple GroupShuffleSplit working with dask is all I need. Is there any other library that supports this? If so, how do I do this with parquet in a lazy way?

理想情况下,我只需要一个简单的GroupShuffleSplit来处理Dask.有没有其他库支持这一点?如果是这样的话,我该如何以一种懒惰的方式使用镶木地板呢?


更多回答

Could you clarify what sort of output you want? GroupShuffleSplit returns indices - would you be happy with the actual rows?

你能说明一下你想要什么样的产量吗?GroupShuffleSplit返回索引--您对实际行满意吗?

What is the group you want to shuffle over? Please help us a bit by constructing a sample input dataframe with at least a few groups, and ideally also an output dataframe explaining what it could look like. It’s great if you add this as pl.DataFrame(…). If you have a dataframe already, you can generate the dict to pass to this by df.head().to_dict(as_series=False). Make sure you have more than one group though.

你想洗的是哪一组?请通过构建一个至少包含几个组的样例输入数据帧来帮助我们,理想情况下还可以构建一个解释它可能是什么样子的输出数据帧。如果您将其添加为pl.DataFrame(…),那就太好了。如果您已经有了一个数据帧,您可以通过df.head().to_dict(as_Series=FALSE)生成要传递给它的dict。不过,要确保你有不止一个小组。

@TomNorway Added an example dataframe, I assume in Polars it would be a LazyFrame with streaming enabled. And yes, rows are fine as long as it works on low ram machines.

@Tom挪威添加了一个示例DataFrame,我假设在Polars中它应该是一个启用了流的LazyFrame。是的,只要在低内存的机器上运行,行就可以。

优秀答案推荐

Mayby some this like this will work for you.

也许这样的事对你有用。


However, it is not a perfect answer, it is trying to tackle problem of a big size of data.

然而,这并不是一个完美的答案,它正在努力解决大数据量的问题。


In this solution, GroupShuffleSplit works for each partiotion of a data but not a hole dataset and due to match_id.unique is used resulting train/test could be not 20/80 size at all.

在此解决方案中,GroupShuffleSplit适用于数据的每个分区,但不适用于孔数据集,并且由于使用了Match_id.Unique,因此生成的训练/测试可能根本不是20/80大小。


Solution


import dask.dataframe as dd
import numpy as np
from sklearn.model_selection import GroupShuffleSplit


train = []
test = []
gss = GroupShuffleSplit(n_splits=1, test_size=0.20, random_state=42) # Adjust random_state as needed

for i in range(df.npartitions):
part = df.partitions[i]
groups = part.match_id.unique().compute()
train_groups, test_groups = next(gss.split(groups, groups=groups))
train += [part[part.match_id.isin(train_groups)]]
test += [part[part.match_id.isin(test_groups)]]


# now in test array you will have list of dask dataframes
# to fetch data from them just concat and compute

dd.concat(test).shape[0].compute() # will give in my case 282_111_648


Solution tested with this data


import polars as pl
import numpy as np
from sklearn.model_selection import GroupShuffleSplit
pl.build_info().get('version')
# '0.19.2'
n_rows = 10**6 # 1_000_000 rows
df = pl.DataFrame([
pl.Series('match_id', np.random.choice(range(10**3), size=n_rows)), # 1_000 matches
pl.Series('team_id', np.random.choice(range(10**2), size=n_rows)), # 100 teams
pl.Series('player_name', np.random.choice([
"kevin", "james", "kelly",
"john", "jenny", "jim", "josh",
"alice", "kevin", "lilly", "erica",
], size=n_rows))
]).lazy()
df = pl.concat([df]*1_0000) # 1_000_000_000 rows
df.collect(streaming=True).write_parquet('test.parquet') # ~5GB


import dask.dataframe as dd
import dask.array as da
import numpy as np
from sklearn.model_selection import GroupShuffleSplit
ddf = dd.read_parquet('your_dataset.parquet')
def dask_group_shuffle_split(df, groups, test_size=0.2, random_state=None):
# Get the unique group values
unique_groups = df[groups].drop_duplicates()

# Create an array of unique group indices
group_indices = da.from_array(unique_groups.to_dask_array(), chunks=1)

# Perform the GroupShuffleSplit
splitter = GroupShuffleSplit(n_splits=1, test_size=test_size, random_state=random_state)
train_idx, test_idx = next(splitter.split(df, groups=df[groups]))

# Map the group indices to corresponding rows
train_groups = group_indices[train_idx].compute()
test_groups = group_indices[test_idx].compute()

# Filter the original DataFrame based on the selected indices
train = df[df.index.isin(train_idx)]
test = df[df.index.isin(test_idx)]

return train, test, train_groups, test_groups
train, test, train_groups, test_groups = dask_group_shuffle_split(ddf, groups='match_id', test_size=0.2, random_state=42)


Let's agree that if your primary objective is to split your data while preserving groups, and you want to work with a lazy computation engine, Dask is indeed a good choice.

我们一致认为,如果您的主要目标是在保留组的同时拆分数据,并且您希望使用懒惰的计算引擎,那么DASK确实是一个很好的选择。


import dask.dataframe as dd
import numpy as np

# For this example, I used the 'from_pandas' method (for my environment)
# In your actual use-case, you need to use dd.read_parquet() method.

pdf = pd.DataFrame(
{
"match_id": [
1, 1, 1,
2, 2, 2, 2,
3, 3, 3, 3,
],
"team_id": [
1, 2, 2,
1, 1, 2, 2,
1, 2, 2, 3,
],
"player_name": [
"kevin", "james", "kelly",
"john", "jenny", "jim", "josh",
"alice", "kevin", "lilly", "erica",
],
}
)

ddf = dd.from_pandas(pdf, npartitions=2)

# Here you need the unique match_ids
unique_matches = ddf['match_id'].unique().compute()

# Here you need to shuffle the unique matches
shuffled_matches = np.random.permutation(unique_matches)

# Here you need to split indices for train-test
split_idx = int(0.8 * len(shuffled_matches))

train_matches = shuffled_matches[:split_idx]
test_matches = shuffled_matches[split_idx:]

# Then filter out the records based on match_id
train_df = ddf[ddf['match_id'].isin(train_matches)]
test_df = ddf[ddf['match_id'].isin(test_matches)]

# The above operation is still lazy. You can compute to get the actual dataframe.
train_df_computed = train_df.compute()
test_df_computed = test_df.compute()

print(train_df_computed)
print(test_df_computed)

My approach is manual and not as elegant as using GroupShuffleSplit, but it serves the purpose but if your data is already in a Parquet file, you can use dd.read_parquet() to read it directly into a Dask DataFrame.
For the .compute() method it will trigger the actual computation in Dask and before that, everything is just a lazy operation.

我的方法是手动的,不像使用GroupShuffleSplit那样优雅,但它是有用途的,但如果您的数据已经在Parquet文件中,您可以使用dd.read_parket()将其直接读取到DaskDataFrame中。对于.Compute()方法,它将触发Dask中的实际计算,在此之前,所有操作都只是一个惰性操作。



If you want to use the GroupBy approach only, here's a way you can do that.

如果只想使用GroupBy方法,这里有一种方法可以做到这一点。


def group_split(grouped_data, test_size=0.2):
ngroups = grouped_data.ngroups
train_size = ngroups - math.ceil(ngroups * test_size)

group_names = list(grouped_data.groups.keys())
train_data = pd.concat((grouped_data.get_group(group_id) for group_id in group_names[:train_size]), ignore_index=True)
test_data = pd.concat((grouped_data.get_group(group_id) for group_id in group_names[train_size:]), ignore_index=True)
return train_data, test_data

Sample output of the same is

相同的示例输出为


group_split(grouped, 0.2)

( match_id team_id player_name
0 1 1 kevin
1 1 2 james
2 1 2 kelly
3 2 1 john
4 2 1 jenny
5 2 2 jim
6 2 2 josh,
match_id team_id player_name
0 3 1 alice
1 3 2 kevin
2 3 2 lilly
3 3 3 erica)

You can add shuffling and other things as well, by using group_names variable, not included here for brevity.

您还可以通过使用GROUP_NAMES变量来添加改组和其他内容,为简洁起见,此处不包括该变量。



You can use dask.dataframe to perform the grouping and splitting operation on your Parquet dataset. Here's an example code snippet that should accomplish what you described:

您可以使用dask.dataframe对您的地块数据集执行分组和拆分操作。下面是一个示例代码片段,它应该可以实现您所描述的内容:


import dask.dataframe as dd
from dask.model_selection import GroupShuffleSplit
# Load your Parquet file into a dask dataframe
dd_df = dd.read_parquet('path/to/your/file.parquet')
# Group by match_id and split into train and test sets
groups = dd_df.groupby('match_id')
train, test = groups.random_state(42).split(test_size=0.2, random_state=42)
# The resulting train and test dataframes will be dask dataframes
print(train.head())
print(test.head())

This code uses dask.dataframe to load your Parquet file and perform the grouping and splitting operation. The GroupShuffleSplit class from dask.model_selection is used to split the data into training and testing sets. The random_state parameter is set to 42 to ensure reproducibility.
The resulting train and test dataframes will be dask dataframes, which you can then use for further processing or modeling tasks. Note that since you're working with a large dataset, it's advisable to work with dask dataframes instead of pandas dataframes to avoid memory constraints.

此代码使用dask.dataframe加载您的Parquet文件并执行分组和拆分操作。Dask.Model_select中的GroupShuffleSplit类用于将数据拆分成训练集和测试集。将RANDOM_STATE参数设置为42以确保重现性。生成的训练和测试数据帧将是DASK数据帧,然后您可以将其用于进一步的处理或建模任务。请注意,由于您使用的是大型数据集,建议使用DaskDataFrame而不是Pandas DataFrame,以避免内存限制。


更多回答

"out of the question to use pandas" from OP

《不可能用熊猫》出自《OP》

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com