作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在尝试使用 Python 中的 Beautifulsoup 从《华尔街日报》中抓取文章。但是,我正在运行的代码正在执行,没有任何错误(退出代码 0),但没有结果。我不明白发生了什么事?为什么此代码没有给出预期结果。
我什至已经付费订阅了。
我知道有些事情不对劲,但我找不到问题所在。
import time
import requests
from bs4 import BeautifulSoup
url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
'&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'
pages = 32
for page in range(1, pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select(".items.hedSumm li > a"):
resp = requests.get(item.get("href"))
_href = item.get("href")
try:
resp = requests.get(_href)
except Exception as e:
try:
resp = requests.get("https://www.wsj.com" + _href)
except Exception as e:
continue
sauce = BeautifulSoup(resp.text,"lxml")
date = sauce.select("time.timestamp.article__timestamp.flexbox__flex--1")
date = date[0].text
tag = sauce.select("li.article-breadCrumb span").text
title = sauce.select_one("h1.wsj-article-headline").text
content = [elem.text for elem in sauce.select("p.article-content")]
print(f'{date}\n {tag}\n {title}\n {content}\n')
time.sleep(3)
正如我在代码中所写,我正在尝试抓取所有文章的日期、标题、标签和内容。如果我能得到关于我的错误的建议,我应该怎样做才能得到想要的结果,那将会很有帮助。
最佳答案
替换您的代码:
resp = requests.get(item.get("href"))
致:
_href = item.get("href")
try:
resp = requests.get(_href)
except Exception as e:
try:
resp = requests.get("https://www.wsj.com"+_href)
except Exception as e:
continue
因为大多数 item.get("href")
没有提供正确的网站网址,例如您得到的网址是这样的。
/news/types/national-security
/public/page/news-financial-markets-stock.html
https://www.wsj.com/news/world
只有 https://www.wsj.com/news/world
是有效的网站 URL。因此您需要将 base URL
与 _href
连接起来。
更新:
import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag
url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
'&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'
pages = 32
for page in range(1, pages+1):
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.find_all("a",{"class":"headline-image"},href=True):
_href = item.get("href")
try:
resp = requests.get(_href)
except Exception as e:
try:
resp = requests.get("https://www.wsj.com"+_href)
except Exception as e:
continue
sauce = BeautifulSoup(resp.text,"lxml")
dateTag = sauce.find("time",{"class":"timestamp article__timestamp flexbox__flex--1"})
tag = sauce.find("li",{"class":"article-breadCrumb"})
titleTag = sauce.find("h1",{"class":"wsj-article-headline"})
contentTag = sauce.find("div",{"class":"wsj-snippet-body"})
date = None
tagName = None
title = None
content = None
if isinstance(dateTag,Tag):
date = dateTag.get_text().strip()
if isinstance(tag,Tag):
tagName = tag.get_text().strip()
if isinstance(titleTag,Tag):
title = titleTag.get_text().strip()
if isinstance(contentTag,Tag):
content = contentTag.get_text().strip()
print(f'{date}\n {tagName}\n {title}\n {content}\n')
time.sleep(3)
操作:
March 31, 2019 10:00 a.m. ET
Tech
Care.com Removes Tens of Thousands of Unverified Listings
The online child-care marketplace Care.com scrubbed its site of tens of thousands of unverified day-care center listings just before a Wall Street Journal investigation published March 8, an analysis shows. Care.com, the largest site in the U.S. for finding caregivers, removed about 72% of day-care centers, or about 46,594 businesses, listed on its site, a Journal review of the website shows. Those businesses were listed on the site as recently as March 1....
Updated March 29, 2019 6:08 p.m. ET
Politics
FBI, Retooling Once Again, Sets Sights on Expanding Cyber Threats
The FBI has launched its biggest transformation since the 2001 terror attacks to retrain and refocus special agents to combat cyber criminals, whose threats to lives, property and critical infrastructure have outstripped U.S. efforts to thwart them. The push comes as federal investigators grapple with an expanding range of cyber attacks sponsored by foreign adversaries against businesses or national interests, including Russian election interference and Chinese cyber thefts from American companies, senior bureau executives...
关于python - 在 python 3.7 中使用 Beautifulsoup 从《华尔街日报》中抓取文章?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56374425/
我正在研究一个问题很多天都没有解决方案。 我得到了什么: 具有 1000 多个建议的 MySQL 数据库 使用 PHP 脚本在页面上显示每日建议 FB 应用程序 + FB SDK 一个 FB 用户帐户
我是一名优秀的程序员,十分优秀!