python - 在 python 3.7 中使用 Beautifulsoup 从《华尔街日报》中抓取文章？-6ren

python - 在 python 3.7 中使用 Beautifulsoup 从《华尔街日报》中抓取文章？

转载作者：行者123 更新时间：2023-11-28 22:10:42

我正在尝试使用 Python 中的 Beautifulsoup 从《华尔街日报》中抓取文章。但是，我正在运行的代码正在执行，没有任何错误(退出代码 0)，但没有结果。我不明白发生了什么事？为什么此代码没有给出预期结果。

我什至已经付费订阅了。

我知道有些事情不对劲，但我找不到问题所在。

import time

import requests

from bs4 import BeautifulSoup

url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
  '&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'

pages = 32
for page in range(1, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".items.hedSumm li > a"):
        resp = requests.get(item.get("href"))
        _href = item.get("href")

        try:
            resp = requests.get(_href)
        except Exception as e:
            try:
            resp = requests.get("https://www.wsj.com" + _href)
        except Exception as e:
            continue
    sauce = BeautifulSoup(resp.text,"lxml")
    date = sauce.select("time.timestamp.article__timestamp.flexbox__flex--1")
    date = date[0].text
    tag = sauce.select("li.article-breadCrumb span").text
    title = sauce.select_one("h1.wsj-article-headline").text
    content = [elem.text for elem in sauce.select("p.article-content")]
    print(f'{date}\n {tag}\n {title}\n {content}\n')

    time.sleep(3)

正如我在代码中所写，我正在尝试抓取所有文章的日期、标题、标签和内容。如果我能得到关于我的错误的建议，我应该怎样做才能得到想要的结果，那将会很有帮助。

最佳答案

替换您的代码:

resp = requests.get(item.get("href"))

致:

_href = item.get("href")

try:
    resp = requests.get(_href)
except Exception as e:
    try:
        resp = requests.get("https://www.wsj.com"+_href)
    except Exception as e:
        continue

因为大多数 item.get("href") 没有提供正确的网站网址，例如您得到的网址是这样的。

/news/types/national-security
/public/page/news-financial-markets-stock.html
https://www.wsj.com/news/world

只有 https://www.wsj.com/news/world 是有效的网站 URL。因此您需要将 base URL 与 _href 连接起来。

更新:

import time
import requests
from bs4 import BeautifulSoup
from bs4.element import Tag

url = 'https://www.wsj.com/search/term.html?KEYWORDS=cybersecurity&min-date=2018/04/01&max-date=2019/03/31' \
  '&isAdvanced=true&daysback=90d&andor=AND&sort=date-desc&source=wsjarticle,wsjpro&page={}'

pages = 32

for page in range(1, pages+1):
    res = requests.get(url.format(page))
    soup = BeautifulSoup(res.text,"lxml")

    for item in soup.find_all("a",{"class":"headline-image"},href=True):
        _href = item.get("href")
        try:
            resp = requests.get(_href)
        except Exception as e:
            try:
                resp = requests.get("https://www.wsj.com"+_href)
            except Exception as e:
                continue

        sauce = BeautifulSoup(resp.text,"lxml")
        dateTag = sauce.find("time",{"class":"timestamp article__timestamp flexbox__flex--1"})
        tag = sauce.find("li",{"class":"article-breadCrumb"})
        titleTag = sauce.find("h1",{"class":"wsj-article-headline"})
        contentTag = sauce.find("div",{"class":"wsj-snippet-body"})

        date = None
        tagName = None
        title = None
        content = None

        if isinstance(dateTag,Tag):
            date = dateTag.get_text().strip()

        if isinstance(tag,Tag):
            tagName = tag.get_text().strip()

        if isinstance(titleTag,Tag):
            title = titleTag.get_text().strip()

        if isinstance(contentTag,Tag):
            content = contentTag.get_text().strip()

        print(f'{date}\n {tagName}\n {title}\n {content}\n')
        time.sleep(3)

操作:

March 31, 2019 10:00 a.m. ET
 Tech
 Care.com Removes Tens of Thousands of Unverified Listings
 The online child-care marketplace Care.com scrubbed its site of tens of thousands of unverified day-care center listings just before a Wall Street Journal investigation published March 8, an analysis shows. Care.com, the largest site in the U.S. for finding caregivers, removed about 72% of day-care centers, or about 46,594 businesses, listed on its site, a Journal review of the website shows. Those businesses were listed on the site as recently as March 1....

Updated March 29, 2019 6:08 p.m. ET
 Politics
 FBI, Retooling Once Again, Sets Sights on Expanding Cyber Threats
 The FBI has launched its biggest transformation since the 2001 terror attacks to retrain and refocus special agents to combat cyber criminals, whose threats to lives, property and critical infrastructure have outstripped U.S. efforts to thwart them. The push comes as federal investigators grapple with an expanding range of cyber attacks sponsored by foreign adversaries against businesses or national interests, including Russian election interference and Chinese cyber thefts from American companies, senior bureau executives...

关于python - 在 python 3.7 中使用 Beautifulsoup 从《华尔街日报》中抓取文章？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56374425/