gpt4 book ai didi

python - 用 BeautifulSoup 抓取新闻

转载 作者:行者123 更新时间:2023-12-04 03:22:26 26 4
gpt4 key购买 nike

我正在尝试从网站上抓取新闻文章。我只对包含 <span class="news_headline"> 的文章感兴趣带有文本“转移”。从本文中,我想从 <div class="news_text"> 中提取范围内的文本.结果应该以 csv 文件结尾,看起来像这样:

R.Wolf; wechselt für 167.000 von Computer zu; Hauke  
Weiner; wechselt für 167.000 von Computer zu; Hauke
Gonther; wechselt für 770.000 von Computer zu; Hauke

3378; wechselt für 167.000 von Computer zu; 514102  
3605; wechselt für 167.000 von Computer zu; 514102
1197; wechselt für 770.000 von Computer zu; 514102

我对编程很陌生,所以我希望任何人都能提供帮助。

<div class="single_news_container">
<div class="news_body_right single_news_content">
<div class="cont_news_headlines">
<span class="wrapper_news_headlines">
<span class="news_headline">Transfers</span>
</span>
</div>
<div class="news_text">
<div>
<p>
<span><a href="/2.bundesliga/players/R. Wolf-3378">R. Wolf</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
</p>
<p>
<span><a href="/2.bundesliga/players/Weiner-3605">Weiner</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
</p>
<p>
<span><a href="/2.bundesliga/players/Gonther-1197">Gonther</a> wechselt für 770.000 von Computer zu <a href="/users/514096">Christoph</a></span>
</p>
</div>
</div>
</div>
</div>

最佳答案

首先,检查html代码的嵌套结构。您会看到要抓取的数据未包含在 div 中。你提到,而是它们都包裹在 <div class="news_body_right single_news_content"> 中.所以你应该运行 find_all在那div然后循环结果以检查是否在这些 div 内的新闻标题包含“转会”。只有这样,您才能提取数据,例如,填充一个空列表,然后将其加载到 pandas 中。并将其保存到 csv :

作为find_all返回一个列表

from bs4 import BeautifulSoup
import pandas as pd

html='''<div class="single_news_container">
<div class="news_body_right single_news_content">
<div class="cont_news_headlines">
<span class="wrapper_news_headlines">
<span class="news_headline">Transfers</span>
</span>
</div>
<div class="news_text">
<div>
<p>
<span><a href="/2.bundesliga/players/R. Wolf-3378">R. Wolf</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
</p>
<p>
<span><a href="/2.bundesliga/players/Weiner-3605">Weiner</a> wechselt für 167.000 von Computer zu <a href="/users/514102">Hauke</a></span>
</p>
<p>
<span><a href="/2.bundesliga/players/Gonther-1197">Gonther</a> wechselt für 770.000 von Computer zu <a href="/users/514096">Christoph</a></span>
</p>
</div>
</div>
</div>'''

soup = BeautifulSoup(html,'html.parser')

data = []

for news in soup.find_all("div", class_="news_body_right single_news_content"):
if 'Transfers' in news.find("span", class_="news_headline"):
for i in news.find("div", class_="news_text").find_all('span'):
subject = i.find_all('a')[0].get_text()
amount = i.get_text().split('für ', 1)[1].split(' von')[0].replace('.','').replace(',','.')
from_player = i.get_text().split('von ', 1)[1].split(' zu')[0]
to_player = i.find_all('a')[1].get_text()
data.append({'subject': subject, 'amount': amount, 'from_player': from_player, 'to_player': to_player})

df = pd.DataFrame(data)
df.to_csv('output.csv')

结果:

<表类="s-表"><头>主题数量from_playerto_player<正文>0R.狼167000计算机豪克1韦纳167000计算机豪克2开始770000计算机克里斯托夫

关于python - 用 BeautifulSoup 抓取新闻,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68240016/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com