python - 抓取纽约时报每日词汇-6ren

python - 抓取纽约时报每日词汇

转载作者：太空宇宙更新时间：2023-11-04 00:41:35

25

4

我最近才开始接触 Scrapy，我选择了《纽约时报》每日一语作为第一个测试。 https://www.nytimes.com/column/learning-word-of-the-day

我注意到他们有一个 API，但就我的确切情况而言，它没有我可以使用的东西(我认为)。我基本上是想浏览该页面上当天的每个单词，并检索单词、含义和示例段落。

这段简短的代码应该遍历每个 url 并至少检索单词，但我遇到了很多错误，我不知道为什么!我一直在使用 SelectorGadget 来获取我需要的 CSS 代码，到目前为止这是我的代码:

import scrapy

class NewYorkSpider(scrapy.Spider):
    name = "times"
    start_urls = [ "https://www.nytimes.com/column/learning-word-of-the-day" ]

    # entry point for the spider
    def parse(self,response):
        for href in response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "headline", " " ))]'):
            url = href.extract()
            yield scrapy.Request(url, callback=self.parse_item)

    def parse_item(self, response):
        word = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "story-subheading", " " ))]//strong').extract()[0]

非常感谢!

更新错误(现在不完全是错误，只是没有抓取假定的信息):

2017-01-18 01:13:48 [scrapy] DEBUG: Filtered duplicate request: <GET https://www.nytimes.com/column/%3Ch2%20class=%22headline%22%20itemprop=%22headline%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Word%20+%20Quiz:%20spawn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3C/h2%3E> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2017-01-18 01:13:48 [scrapy] DEBUG: Crawled (404) <GET https://www.nytimes.com/column/%3Ch2%20class=%22headline%22%20itemprop=%22headline%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Word%20+%20Quiz:%20spawn%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3C/h2%3E> (referer: https://www.nytimes.com/column/learning-word-of-the-day)
2017-01-18 01:13:48 [scrapy] DEBUG: Crawled (404) <GET https://www.nytimes.com/column/%3Ch2%20class=%22headline%22%20itemprop=%22headline%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Word%20+%20Quiz:%20introvert%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3C/h2%3E> (referer: https://www.nytimes.com/column/learning-word-of-the-day)
2017-01-18 01:13:48 [scrapy] DEBUG: Crawled (404) <GET https://www.nytimes.com/column/%3Ch2%20class=%22headline%22%20itemprop=%22headline%22%3E%0A%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20Word%20+%20Quiz:%20funereal%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%20%3C/h2%3E> (referer: https://www.nytimes.com/column/learning-word-of-the-day)

最佳答案

您正在 .css 中使用 xpath 表达式方法，用于 css 选择器表达式。
只需替换 .css与 .xpath :

response.css('//*[contains(concat( " ", @class, " " ), concat( " ", "headline", " " ))]')
# to
response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "headline", " " ))]')

关于您的第二个错误 - 提取的网址不是绝对网址，例如/some/sub/page.html .要将其转换为绝对网址，您可以使用 response.urljoin()功能:

 for href in response.xpath('...'):
    url = href.extract()
    full_url = response.urljoin(url)
    yield Request(full_url)

关于您的第三个错误 - 您的 xpath 在这里有问题。看起来您使用了一些 xpath 生成器，而这些东西很少生成任何有值(value)的东西。您在这里寻找的只是一个 <a>节点 story-link类:

urls = response.xpath('//a[@class="story-link"]/@href').extract()
for url in urls:
    yield Request(response.urljoin(full_url))

对于你的单词 xpath，你可以简单地使用 text under node which is under :

word = response.xpath("//h4/strong/text()").extract_first()

关于python - 抓取纽约时报每日词汇，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41709564/

25

4

0

文章推荐： html - 三个并排不同宽度的div，左右固定

文章推荐： angularjs - 刚接触 MEAN 堆栈，如何获取数据？

文章推荐： json - NPM 传递的 package.json 抛出错误

文章推荐： javascript - Bootstrap 右下角偏移的较大圆圈内的小圆圈切口

node.js - 纽约 + Mocha + es6 模块
我正在尝试使用 nyc + mocha 来获得使用 es6 模块语法的单元测试的测试覆盖率。当我运行 mocha my_test.mjs一切正常。 my_test.mjs 中的依赖项(使用 nati
javascript - 如何从(纽约)网页到 Chrome 扩展进行安全通信
我想在网页与 Chrome 扩展程序之间进行安全通信。经过相当多的检查和黑客攻击，这似乎至少是非常重要的，如果不是完全不可能的话。我想从网页中的(一些 JavaScript)向扩展程序的后台页面(在
fonts - SwiftUI:如何在 NavBarTitle 中获取 .serif(纽约)字体
有没有关于如何在导航栏标题中使用 Apples New York 字体的简单解决方案？我尝试了以下方法，但没有成功: .navigationBarTitle(Text("TestTitle").fon
reactjs - 当 NODE_ENV 设置为测试时，纽约( Istanbul 尔)不起作用
我正在对使用 ES6 语法编写的 React 组件执行 Mocha 测试。我正在使用 Istanbul 尔进行代码覆盖率测试。当我将 NODE_ENV 设置为“test”时，我得到以下输出: ----
python - pandas tz_convert : difference among EST, 美国/东部和美洲/纽约
我的理解是 EST、US/Eastern 和 America/New_York 应该是一样的，但显然我错了。当我执行以下操作时: pd.Timestamp('2011-07-03T07:00:00-

首页

博学

6Ren·AI

商城

python - 抓取纽约时报每日词汇