gpt4 book ai didi

python - 如何通过标记现有数据帧的内容来创建新数据帧?

转载 作者:行者123 更新时间:2023-12-01 07:34:29 26 4
gpt4 key购买 nike

我对 python/pandas 非常陌生,需要社区的帮助。这就是我正在尝试做的事情。

我已经读取了一个包含以下数据的 json 文件:

  1. (文章的)内容
  2. ID(唯一标识符)
  3. 标题(文章标题)

使用此代码:

import pandas as pd
df = pd.read_json(path_to_file, lines=True)

所需输出:我想创建一个新的数据框,使其具有两列

  1. ID(唯一标识符)
  2. 句子(将 df 的内容列拆分为句子)

到目前为止我能做的事情:

发现分词器来自nltk,以及如何将其传递给apply函数

  result = df["content"].apply(sent_tokenize) 

我的问题是如何获得上述所需格式的结果。

最佳答案

使用 itertuples 迭代数据帧

import pandas as pd
df = pd.DataFrame([['hi how are you. i am fine. hope this help you','ABC']], columns = ['sent','ID'])

df
sent ID
0 hi how are you. i am fine. hope this help you ABC

new_sent =[]
for row in df.itertuples():
for sent in sent_tokenize(row[1]):
new_sent.append((sent, row[2]))

#creating dataframe for new_sent
df_new = pd.DataFrame(new_sent, columns = ['tokenized_sent', 'ID'])
#o/p

tokenized_sent ID
0 hi how are you. ABC
1 i am fine. ABC
2 hope this help you ABC

说明

for row in df.itertuples():
print(row)

#o/p
Pandas(Index=0, sent='hi how are you. i am fine. hope this help you', ID='ABC')

print(row[0])
0

print(row[1])
'hi how are you. i am fine. hope this help you'

print(row[2])
'ABC'

现在我们对第二个元素执行标记化,并将其 id 的句子附加到 new_list

new_list = []
for sent in sent_tokenize(row[1]):
new_list.append((sent, row[2]))
print((sent, row[2]))

o/p
('hi how are you.', 'ABC')
('i am fine.', 'ABC')
('hope this help you', 'ABC')

# now create dataframe with this new_sent
df_new = pd.DataFrame(new_sent, columns = ['tokenized_sent', 'ID'])

关于python - 如何通过标记现有数据帧的内容来创建新数据帧?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57052998/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com