gpt4 book ai didi

python - 在 python 3.7 中使用 Beautifulsoup 从《华尔街日报》中抓取文章?

转载 作者:行者123 更新时间:2023-11-28 22:10:42 24 4
gpt4 key购买 nike

我正在尝试使用 Python 中的 Beautifulsoup 从《华尔街日报》中抓取文章。但是,我正在运行的代码正在执行,没有任何错误(退出代码 0),但没有结果。我不明白发生了什么事?为什么此代码没有给出预期结果。

我什至已经付费订阅了。

我知道有些事情不对劲,但我找不到问题所在。

import time

import requests

from bs4 import BeautifulSoup

url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
'&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'

pages = 32
for page in range(1, pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".items.hedSumm li > a"):
resp = requests.get(item.get("href"))
_href = item.get("href")

try:
resp = requests.get(_href)
except Exception as e:
try:
resp = requests.get("https://www.wsj.com" + _href)
except Exception as e:
continue
sauce = BeautifulSoup(resp.text,"lxml")
date = sauce.select("time.timestamp.article__timestamp.flexbox__flex--1")
date = date[0].text
tag = sauce.select("li.article-breadCrumb span").text
title = sauce.select_one("h1.wsj-article-headline").text
content = [elem.text for elem in sauce.select("p.article-content")]
print(f'{date}\n {tag}\n {title}\n {content}\n')

time.sleep(3)

正如我在代码中所写,我正在尝试抓取所有文章的日期、标题、标签和内容。如果我能得到关于我的错误的建议,我应该怎样做才能得到想要的结果,那将会很有帮助。

最佳答案

替换您的代码:

resp = requests.get(item.get("href"))

致:

_href = item.get("href")

try:
resp = requests.get(_href)
except Exception as e:
try:
resp = requests.get("https://www.wsj.com"+_href)
except Exception as e:
continue

因为大多数 item.get("href") 没有提供正确的网站网址,例如您得到的网址是这样的。

/news/types/national-security
/public/page/news-financial-markets-stock.html
https://www.wsj.com/news/world

只有 https://www.wsj.com/news/world 是有效的网站 URL。因此您需要将 base URL_href 连接起来。

更新:

import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag

url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
'&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'

pages = 32

for page in range(1, pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")

for item in soup.find_all("a",{"class":"headline-image"},href=True):
_href = item.get("href")
try:
resp = requests.get(_href)
except Exception as e:
try:
resp = requests.get("https://www.wsj.com"+_href)
except Exception as e:
continue

sauce = BeautifulSoup(resp.text,"lxml")
dateTag = sauce.find("time",{"class":"timestamp article__timestamp flexbox__flex--1"})
tag = sauce.find("li",{"class":"article-breadCrumb"})
titleTag = sauce.find("h1",{"class":"wsj-article-headline"})
contentTag = sauce.find("div",{"class":"wsj-snippet-body"})

date = None
tagName = None
title = None
content = None

if isinstance(dateTag,Tag):
date = dateTag.get_text().strip()

if isinstance(tag,Tag):
tagName = tag.get_text().strip()

if isinstance(titleTag,Tag):
title = titleTag.get_text().strip()

if isinstance(contentTag,Tag):
content = contentTag.get_text().strip()

print(f'{date}\n {tagName}\n {title}\n {content}\n')
time.sleep(3)

操作:

March 31, 2019 10:00 a.m. ET
Tech
Care.com Removes Tens of Thousands of Unverified Listings
The online child-care marketplace Care.com scrubbed its site of tens of thousands of unverified day-care center listings just before a Wall Street Journal investigation published March 8, an analysis shows. Care.com, the largest site in the U.S. for finding caregivers, removed about 72% of day-care centers, or about 46,594 businesses, listed on its site, a Journal review of the website shows. Those businesses were listed on the site as recently as March 1....

Updated March 29, 2019 6:08 p.m. ET
Politics
FBI, Retooling Once Again, Sets Sights on Expanding Cyber Threats
The FBI has launched its biggest transformation since the 2001 terror attacks to retrain and refocus special agents to combat cyber criminals, whose threats to lives, property and critical infrastructure have outstripped U.S. efforts to thwart them. The push comes as federal investigators grapple with an expanding range of cyber attacks sponsored by foreign adversaries against businesses or national interests, including Russian election interference and Chinese cyber thefts from American companies, senior bureau executives...

关于python - 在 python 3.7 中使用 Beautifulsoup 从《华尔街日报》中抓取文章?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56374425/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com