gpt4 book ai didi

python - 如何确保在我的 Scrapy 蜘蛛中解析每个 URL

转载 作者:太空宇宙 更新时间:2023-11-03 21:18:51 24 4
gpt4 key购买 nike

我正在尝试抓取美食博客上食谱列表的每一页,抓取每个页面上的食谱网址,并将它们全部写入一个 .txt 文件。就我目前的代码而言,它可以正常工作,但仅限于 start_requests 方法内的 urls 中列出的第一个 URL。

我添加了一个 .log() 来检查 urls 确实包含我在执行 Scrapy 时尝试从中抓取的所有正确 URL命令提示符后,我得到以下确认信息:

2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=1
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=2
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=3
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=4
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=5
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=6
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=7
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=8
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=9
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=10
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=11
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=12
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=13
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=14
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=15
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=16
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=17
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=18
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=19
2019-01-31 22:16:17 [recipes] DEBUG: https://pinchofyum.com/recipes?fwp_paged=20

等等

我当前的代码:

import scrapy
from bs4 import BeautifulSoup


class QuotesSpider(scrapy.Spider):
name = "recipes"

def start_requests(self):
urls = []
for i in range (1, 60):
curr_url = "https://pinchofyum.com/recipes?fwp_paged=%s" % i
self.log(curr_url)
urls.append(curr_url)
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
soup = BeautifulSoup(response.body, "html.parser")
page_links = soup.find_all(class_="post-summary")
for link in page_links:
with open("links.txt", "a") as f:
f.write(link.a["href"] + "\n")

当我运行上述代码时,我将以下输出写入 links.txt:

https://pinchofyum.com/5-minute-vegan-yogurt
https://pinchofyum.com/red-curry-noodles
https://pinchofyum.com/15-minute-meal-prep-cauliflower-fried-rice-with-crispy-tofu
https://pinchofyum.com/5-ingredient-vegan-vodka-pasta
https://pinchofyum.com/lentil-greek-salads-with-dill-sauce
https://pinchofyum.com/coconut-oil-granola-remix
https://pinchofyum.com/quinoa-crunch-salad-with-peanut-dressing
https://pinchofyum.com/15-minute-meal-prep-cilantro-lime-chicken-and-lentils
https://pinchofyum.com/instant-pot-sweet-potato-tortilla-soup
https://pinchofyum.com/garlic-butter-baked-penne
https://pinchofyum.com/15-minute-meal-prep-creole-chicken-and-sausage
https://pinchofyum.com/lemon-chicken-soup-with-orzo
https://pinchofyum.com/brussels-sprouts-tacos
https://pinchofyum.com/14-must-bake-holiday-cookie-recipes
https://pinchofyum.com/how-to-cook-chicken

这里的链接是正确的,但应该还有 50 多页。

有什么建议吗?我错过了什么?

最佳答案

我的理解是你想确保 urls 内的每一页已成功抓取并包含链接,如果是,请参阅下面的代码

import scrapy
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class QuotesSpider(scrapy.Spider):
name = "recipes"
urls = []

def __init__(self):
dispatcher.connect(self.spider_closed, signals.spider_closed)

def start_requests(self):
for i in range (1, 60):
curr_url = "https://pinchofyum.com/recipes?fwp_paged=%s" % i
self.log(curr_url)
self.urls.append(curr_url)
yield scrapy.Request(url=curr_url, callback=self.parse)

def parse(self, response):
page_links = response.css(".post-summary")
if len(page_links)>0:
del self.urls[response.url] #delete from URLS to confirm that it has been parsed
for link in page_links:
with open("links.txt", "a") as f:
f.write(link.a["href"] + "\n")


def spider_closed(self, spider):
self.log("Following URLs were not parsed: %s"%(self.urls))

它的作用是将所有要抓取的网址附加到 self.urls 中一旦 URL 被抓取并且其中也包含链接,它就会从 self.urls 中删除

请注意,还有另一种方法称为 spider_closed ,它仅在 scraper 完成时执行,因此它会打印未抓取或其中没有链接的 url

另外,为什么使用 BeautifulSoup?只需使用 Python Scrapy 的 Selector 类

关于python - 如何确保在我的 Scrapy 蜘蛛中解析每个 URL,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54474018/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com