gpt4 book ai didi

python-3.x - 从 .txt 文件读取到以换行符作为分隔符的 Pandas 数据帧

转载 作者:行者123 更新时间:2023-12-04 07:44:11 24 4
gpt4 key购买 nike

我想从文本文件中提取一些数据到数据框:
文本文件看起来像这样

URL: http://www.nytimes.com/2016/06/30/sports/baseball/washington-nationals-max-scherzer-baffles-mets-completing-a-sweep.html

WASHINGTON — Stellar .... stretched thin.
“We were going t......e do anything.”
Wednesday’s ... starter.
“We’re n... work.”
The Mets did not scor....their 40-37 record.

URL: http://www.nytimes.com/2016/06/30/nyregion/mayor-de-blasios-counsel-to-leave-next-month-to-lead-police-review-board.html

Mayor Bill de .... Department.
The move.... April.
A civil ... conversations.
More... administration.

URL: http://www.nytimes.com/2016/06/30/nyregion/three-men-charged-in-killing-of-cuomo-administration-lawyer.html

In the early..., the Folk Nation.
As hundreds ... wounds.
For some...residents.
On Wednesd...killing.
One ...murder.

它包含来自纽约时报文章的 URL 和文本,我想创建一个 2 列的数据框,第一列是 URL,第二列是文本。
我遇到的问题是我无法处理分隔符,因为 URL 和相应的文本之间有两行新行。但是文本本身也有单独的新行。
我尝试使用此代码,但没有获得 2 列数据框,而是获得了一个单列,其中每个换行符都有一个新行,因此它还将文本分成多个段落,我使用的是 dask btw :
df_csv = dd.read_csv(filename,sep="\n\n",header=None,engine='python')

最佳答案

# read file
file = open('ny.txt', encoding="utf8").read()

url = []
text = []

# split text at every 2-new-lines
# elements at 'odd' positions are 'urls'
# elements at 'even' positions are 'text/content'
for ind, line in enumerate(file.split('\n\n')):
if ind%2==0:
url.append(line)
else:
text.append(line)

# save to a dataframe
df = pd.DataFrame({'url':url, 'text':text})
df
url text
0 URL: http://www.nytimes.com/2016/06/30/sports/... WASHINGTON — Stellar .... stretched thin.\n“We...
1 URL: http://www.nytimes.com/2016/06/30/nyregio... Mayor Bill de .... Department.\nThe move.... A...
2 URL: http://www.nytimes.com/2016/06/30/nyregio... In the early..., the Folk Nation.\nAs hundreds...

# ADDITIONAL : Remove the characters 'URL: ' with empty string
df['url'] = df['url'].str.replace('URL: ', '')
df
url text
0 http://www.nytimes.com/2016/06/30/sports/baseb... WASHINGTON — Stellar .... stretched thin.\n“We...
1 http://www.nytimes.com/2016/06/30/nyregion/may... Mayor Bill de .... Department.\nThe move.... A...
2 http://www.nytimes.com/2016/06/30/nyregion/thr... In the early..., the Folk Nation.\nAs hundreds...

关于python-3.x - 从 .txt 文件读取到以换行符作为分隔符的 Pandas 数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67280726/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com