gpt4 book ai didi

python - 如何为特定列选择具有最长句子的一行并合并以在 Python 中形成新的数据框?

转载 作者:行者123 更新时间:2023-12-04 10:06:34 26 4
gpt4 key购买 nike

我正在使用的数据集看起来像这样。它是一个视频字幕数据集,在“描述”列下带有字幕。

Video_ID       Description
mv89psg6zh4 A bird is bathing in a sink.
mv89psg6zh4 A faucet is running while a bird stands and is taking bath under it.
mv89psg6zh4 A bird gets washed.
mv89psg6zh4 A parakeet is taking a shower in a sink.
mv89psg6zh4 The bird is taking a bath under the faucet.
mv89psg6zh4 A bird is standing in a sink drinking water.
l7x8uIdg2XU A woman is pouring ingredients into a bowl and then eating it.
l7x8uIdg2XU A woman is adding milk to some pasta.
l7x8uIdg2XU A person adds ingredients to pasta.
l7x8uIdg2XU the girls are doing the cooking.

但是,每个视频的字幕数量不同且不统一。

我打算为一个唯一的 Video_ID 提取具有最长“描述”(即最大字数)的一行,并形成一个合并这些唯一行的新数据帧。

我想要的结果应该是这样的:

需要数据框-
Video_ID       Description
mv89psg6zh4 A faucet is running while a bird stands and is taking bath under it.
l7x8uIdg2XU A woman is pouring ingredients into a bowl and then eating it.

这样行基本上从现有的数据帧中移动,形成一个新的数据帧,其中包含原始数据集中最长的句子。

我尝试使用以下代码:
s = df.index.to_series().groupby(df['Video_ID']).apply(lambda x: len(x['Description']).max())

但这似乎不起作用。你能建议正确的方法吗?

最佳答案

使用 Series.str.len 获取长度,然后通过 DataFrameGroupBy.idxmax 按每组的最大值获取索引值最后选择 DataFrame.loc :

df1 = df.loc[df['Description'].str.len().groupby(df['Video_ID'], sort=False).idxmax()]
print (df1)
Video_ID Description
1 mv89psg6zh4 A faucet is running while a bird stands and is...
6 l7x8uIdg2XU A woman is pouring ingredients into a bowl and...

详情 :
print (df['Description'].str.len())
0 28
1 68
2 19
3 40
4 43
5 44
6 62
7 37
8 35
9 32
Name: Description, dtype: int64

print (df['Description'].str.len().groupby(df['Video_ID'], sort=False).idxmax())
Video_ID
mv89psg6zh4 1
l7x8uIdg2XU 6
Name: Description, dtype: int64

对于过滤器不匹配的行,可以使用 Index.isin 带倒置 mask ~ boolean indexing :
df2 = df[~df.index.isin(df1.index)]
print (df2)
Video_ID Description
0 mv89psg6zh4 A bird is bathing in a sink.
2 mv89psg6zh4 A bird gets washed.
3 mv89psg6zh4 A parakeet is taking a shower in a sink.
4 mv89psg6zh4 The bird is taking a bath under the faucet.
5 mv89psg6zh4 A bird is standing in a sink drinking water.
7 l7x8uIdg2XU A woman is adding milk to some pasta.
8 l7x8uIdg2XU A person adds ingredients to pasta.
9 l7x8uIdg2XU the girls are doing the cooking.

编辑:上面的解决方案只返回每组最大长度的一行。 (这里工作相同,因为样本数据中每组只有一个最大长度)

如果希望每组有多个最大值,则可以在 GroupBy.transform 中使用最大长度:
s = df['Description'].str.len()
mask = s.groupby(df['Video_ID'], sort=False).transform('max').eq(s)
df1 = df[mask]
print (df1)
Video_ID Description
1 mv89psg6zh4 A faucet is running while a bird stands and is...
6 l7x8uIdg2XU A woman is pouring ingredients into a bowl and...

df2 = df[~mask]
print (df2)
Video_ID Description
0 mv89psg6zh4 A bird is bathing in a sink.
2 mv89psg6zh4 A bird gets washed.
3 mv89psg6zh4 A parakeet is taking a shower in a sink.
4 mv89psg6zh4 The bird is taking a bath under the faucet.
5 mv89psg6zh4 A bird is standing in a sink drinking water.
7 l7x8uIdg2XU A woman is adding milk to some pasta.
8 l7x8uIdg2XU A person adds ingredients to pasta.
9 l7x8uIdg2XU the girls are doing the cooking.

细节:
print (s.groupby(df['Video_ID'], sort=False).transform('max'))
0 68
1 68
2 68
3 68
4 68
5 68
6 62
7 62
8 62
9 62
Name: Description, dtype: int64

关于python - 如何为特定列选择具有最长句子的一行并合并以在 Python 中形成新的数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61558247/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com