gpt4 book ai didi

python - 拆分单元格中的文本并为标记创建额外的行

转载 作者:太空宇宙 更新时间:2023-11-04 11:19:00 25 4
gpt4 key购买 nike

假设我在 pandasDataFrame 中有以下内容:

id  text
1 I am the first document and I am very happy.
2 Here is the second document and it likes playing tennis.
3 This is the third document and it looks very good today.

我想将每个 id 的文本拆分为 3 个单词的标记,所以我最终想要以下内容:

id  text
1 I am the
1 first document and
1 I am very
1 happy
2 Here is the
2 second document and
2 it likes playing
2 tennis
3 This is the
3 third document and
3 it looks very
3 good today

请记住,除了这两列之外,我的数据框可能还有其他列,它们应该以与上面的 id 相同的方式简单地复制到新的数据框。

最有效的方法是什么?

我认为我的问题的解决方案与此处给出的解决方案非常接近:Tokenise text and create more rows for each row in dataframe .

这也可能有帮助:Python: Split String every n word in smaller Strings .

最佳答案

你可以使用类似的东西:

def divide_chunks(l, n): 
# looping till length l
for i in range(0, len(l), n):
yield l[i:i + n]

然后使用 unnesting :

df['text_new']=df.text.apply(lambda x: list(divide_chunks(x.split(),3)))
df_new=unnesting(df,['text_new']).drop('text',1)
df_new.text_new=df_new.text_new.apply(' '.join)
print(df_new)

              text_new  id
0 I am the 1
0 first document and 1
0 I am very 1
0 happy. 1
1 Here is the 2
1 second document and 2
1 it likes playing 2
1 tennis. 2
2 This is the 3
2 third document and 3
2 it looks very 3
2 good today. 3

编辑:

m=(pd.DataFrame(df.text.apply(lambda x: list(divide_chunks(x.split(),3))).values.tolist())
.unstack().sort_index(level=1).apply(' '.join).reset_index(level=1))
m.columns=df.columns
print(m)

   id                 text
0 0 I am the
1 0 first document and
2 0 I am very
3 0 happy.
0 1 Here is the
1 1 second document and
2 1 it likes playing
3 1 tennis.
0 2 This is the
1 2 third document and
2 2 it looks very
3 2 good today.

关于python - 拆分单元格中的文本并为标记创建额外的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56395681/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com