gpt4 book ai didi

python - 不能以正确的方式在蜘蛛中使用 dont_filter=true 来避免一些不需要的事件

转载 作者:行者123 更新时间:2023-12-04 09:06:20 25 4
gpt4 key购买 nike

我创建了一个蜘蛛来解析来自某个相同站点(由文本文件提供)登录页面的不同容器的链接,然后使用该链接从它的内页获取标题。很少有链接有下一页按钮,蜘蛛会相应地处理这些按钮。
蜘蛛确实解析了内容,但陷入了由 dont_filter=True 引起的无限循环。范围。如果我不使用该参数,蜘蛛就不会重用一些最初未能产生所需响应的链接。
我用过这个参数 dont_filter=True在三个地方。

  • _retry()中间件中的方法
  • parse() 内的最后一行方法
  • parse_content() 内的最后一行方法

  • 我创建的蜘蛛:
    import os
    import scrapy
    import urllib
    from bs4 import BeautifulSoup
    from scrapy.crawler import CrawlerProcess


    class YelpSpider(scrapy.Spider):
    name = "yelpspidescript"

    with open("all_urls.txt") as f:
    start_urls = f.readlines()

    def start_requests(self):
    for url in self.start_urls:
    yield scrapy.Request(url,callback=self.parse,meta={"lead_link":url})

    def parse(self,response):
    if response.meta.get("lead_link"):
    lead_link = response.meta.get("lead_link")
    elif response.meta.get("redirect_urls"):
    lead_link = response.meta.get("redirect_urls")[0]

    soup = BeautifulSoup(response.text, 'lxml')
    if soup.select("[class*='hoverable'] h4 a[href^='/biz/'][name]"):
    for item in soup.select("[class*='hoverable'] h4 a[href^='/biz/'][name]"):
    lead_link = response.urljoin(item.get("href"))
    yield scrapy.Request(lead_link,meta={"lead_link":lead_link},callback=self.parse_content)

    next_page = soup.select_one("a[class*='next-link'][href^='/search?']")
    if next_page:
    link = response.urljoin(next_page.get("href"))
    yield scrapy.Request(link,meta={"lead_link":link},callback=self.parse)

    else:
    yield scrapy.Request(lead_link,meta={"lead_link":lead_link},callback=self.parse,dont_filter=True)

    def parse_content(self,response):
    if response.meta.get("lead_link"):
    lead_link = response.meta.get("lead_link")
    elif response.meta.get("redirect_urls"):
    lead_link = response.meta.get("redirect_urls")[0]

    soup = BeautifulSoup(response.text, 'lxml')

    if soup.select_one("h1[class*='heading--inline__']"):
    try:
    name = soup.select_one("h1[class*='heading--inline__']").get_text(strip=True)
    except AttributeError: name = ""
    print(name)

    else:
    yield scrapy.Request(lead_link,meta={"lead_link":lead_link},callback=self.parse_content,dont_filter=True)


    if __name__ == "__main__":
    c = CrawlerProcess({
    'USER_AGENT':'Mozilla/5.0',
    'LOG_LEVEL':'ERROR',
    })
    c.crawl(YelpSpider)
    c.start()
    中间件:
    from fake_useragent import UserAgent


    RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 403, 401, 400, 404, 408]

    class yelp_custom_Middleware(object):
    ua = UserAgent()

    def process_request(self, request, spider):
    request.headers['User-Agent'] = self.ua.random

    def process_exception(self, request, exception, spider):
    return self._retry(request, exception, spider)

    def _retry(self, request, reason, spider):
    retryreq = request.copy()
    retryreq.dont_filter = True
    return retryreq

    def process_response(self, request, response, spider):
    if request.meta.get('dont_retry', False):
    return response
    if response.status in RETRY_HTTP_CODES:
    reason = response_status_message(response.status)
    return self._retry(request, reason, spider) or response
    return response
    我怎样才能让蜘蛛不陷入无限循环?
    编辑:
    我想包括 few of the urls 我正在尝试使用哪些在 all_urls.txt 内文件,以防它有助于更​​好地识别问题。

    最佳答案

    您可以计算每个 URL 的重试次数:

    class yelp_custom_Middleware(object):
    ua = UserAgent()
    max_retries = 3
    retry_urls = {}

    def process_request(self, request, spider):
    request.headers['User-Agent'] = self.ua.random

    def process_exception(self, request, exception, spider):
    return self._retry(request, exception, spider)

    def _retry(self, request, reason, spider):
    retry_url = request.url
    if retry_url not in self.retry_urls:
    self.retry_urls[retry_url] = 1
    else:
    self.retry_urls[retry_url] += 1

    if self.retry_urls[retry_url] > self.max_retries:
    # Dont' retry
    else:
    # Retry

    关于python - 不能以正确的方式在蜘蛛中使用 dont_filter=true 来避免一些不需要的事件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63436737/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com