gpt4 book ai didi

python - Scrapy - 通过动态添加 allowed_urls 来克服 start_uri 重定向 - parse_start_url 问题

转载 作者:太空宇宙 更新时间:2023-11-03 16:23:57 24 4
gpt4 key购买 nike

Stackoverflow 社区您好

我有以下问题:

我正在尝试抓取一长串网站。 start_url 列表中的一些网站重定向 (301)。我希望 scrapy 从 start_url 列表中抓取重定向的网站,就好像它们也在 allowed_domain 列表中一样(但它们不是)。例如,example.com 位于我的 start_url 列表中,并允许域列表和 example.com 重定向到 foo.com。我想抓取 foo.com。

DEBUG: Redirecting (301) to <GET http://www.foo.com/> from <GET http://www.example.com>

我注意到以下回复 Scrapy Crawl all websites in start_url even if redirect通过修改OffsiteMiddleware提供了解决方案。我理解这部分,但我不确定 parse_start_url 是如何被覆盖的。这是我到目前为止的代码:

import scrapy
import urllib.request
import urllib.parse
import json
from placementarchitect import bingapi
import tldextract

from spiderproject.items import DmozItem
from scrapy.crawler import CrawlerProcess

class GoodSpider(scrapy.Spider):
name = "goodoldSpider"

def __init__(self, input=None):
self.searchterm = input
self.urlstocrawl = bingapi.get_crawl_urls(self.searchterm) # This returns a list of crawlable sites from the BingSearchAPI
self.start_urls = self.urlstocrawl
self.allowed_domains = []

def parse_start_url(self, response):
domain = tldextract.extract(str(response.request.url)).registered_domain
if domain not in self.allowed_domains:
self.allowed_domains.append(domain)
else:
return self.parse(response)

def parse(self, response):
for href in response.xpath("//a/@href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)

def parse_dir_contents(self, response):
for sel in response.xpath('//div[attribute::class="cat-item"]'):
item = DmozItem()
item['title'] = sel.xpath('a/div/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item

next_page = response.css(".cat-item>a::attr('href')")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_dir_contents)

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(GoodSpider, input='"good news"')
process.start() # the script will block here until the crawling is finished

scrapy 文档在 parse_start_url 上很少,所以我不确定这是如何实现的。因此我的解决方案似乎不起作用。

恐怕这是由于如何

def parse_start_url()

已实现。

如有任何建议,我们将不胜感激。

迈克

最佳答案

好吧,我明白了。实际上没有必要:

def parse_start_url(...)

相反,我将之前在 def parse_start_url(...) 下的代码集成到蜘蛛的主要 def parse 函数中:

import scrapy
import urllib.request
import urllib.parse
import json
from placementarchitect import bingapi
import tldextract

from spiderproject.items import DmozItem
from scrapy.crawler import CrawlerProcess

class GoodSpider(scrapy.Spider):
name = "goodoldSpider"

def __init__(self, input=None):
self.searchterm = input
self.urlstocrawl = bingapi.get_crawl_urls(self.searchterm) # This returns a list of crawlable sites from the BingSearchAPI
self.start_urls = self.urlstocrawl
self.allowed_domains = []

print("TEST >>> In Searchscraper.py: " + str(self.urlstocrawl))

## Commented this part out as it is not required anymore - code was integrated into def parse(..) below
# def parse_start_url(self, response):
# domain = tldextract.extract(str(response.request.url)).registered_domain
# print(domain)
# if domain not in self.allowed_domains:
# self.allowed_domains.append(domain)
# return self.parse(response.url, callback=self.parse)

def parse(self, response):
domain = tldextract.extract(str(response.request.url)).registered_domain
print(domain)
if domain not in self.allowed_domains:
self.allowed_domains.append(domain)
for href in response.xpath("//a/@href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)

def parse_dir_contents(self, response):
for sel in response.xpath('//div[attribute::class="cat-item"]'):
item = DmozItem()
item['title'] = sel.xpath('a/div/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item

next_page = response.css(".cat-item>a::attr('href')")
if next_page:
url = response.urljoin(next_page[0].extract())
yield scrapy.Request(url, self.parse_dir_contents)

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(GoodSpider, input='"good news"')
process.start() # the script will block here until the crawling is finished

此解决方案动态地将我们的初始 start_url 重定向到 allowed_domains 列表的域添加。

这意味着所有其他域将按如下方式进行过滤:

[scrapy] DEBUG: Filtered offsite request to 'www.pinterest.com': <GET http://www.pinterest.com/goodnewsnetwork/>

关于python - Scrapy - 通过动态添加 allowed_urls 来克服 start_uri 重定向 - parse_start_url 问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38162214/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com