gpt4 book ai didi

python - Scrapy爬取完成,没有爬取所有启动请求

转载 作者:行者123 更新时间:2023-12-03 06:10:49 25 4
gpt4 key购买 nike

我正在尝试使用 scrapy 库来运行 broad crawl - 抓取我解析数百万个网站的位置。蜘蛛连接到 PostgreSQL 数据库。这是我在启动蜘蛛之前加载未处理的网址的方式:

def get_unprocessed_urls(self, suffix):
"""
Fetch unprocessed urls.
"""

print(f'Fetching unprocessed urls for suffix {suffix}...')

cursor = self.connection.cursor('unprocessed_urls_cursor', withhold=True)
cursor.itersize = 1000
cursor.execute(f"""
SELECT su.id, su.url FROM seed_url su
LEFT JOIN footer_seed_url_status fsus ON su.id = fsus.seed_url_id
WHERE su.url LIKE \'%.{suffix}\' AND fsus.seed_url_id IS NULL;
""")

ID = 0
URL = 1

urls = [Url(url_row[ID], self.validate_url(url_row[URL])) for url_row in cursor]

print('len urls:', len(urls))
return urls

这是我的蜘蛛:

class FooterSpider(scrapy.Spider):

...

def start_requests(self):

urls = self.handler.get_unprocessed_urls(self.suffix)

for url in urls:

yield scrapy.Request(
url=url.url,
callback=self.parse,
errback=self.errback,
meta={
'seed_url_id': url.id,
}
)

def parse(self, response):

try:

seed_url_id = response.meta.get('seed_url_id')

print(response.url)

soup = BeautifulSoup(response.text, 'html.parser')

footer = soup.find('footer')

item = FooterItem(
seed_url_id=seed_url_id,
html=str(footer) if footer is not None else None,
url=response.url
)
yield item
print(f'Successfully processed url {response.url}')

except Exception as e:
print('Error while processing url', response.url)
print(e)

seed_url_id = response.meta.get('seed_url_id')

cursor = self.handler.connection.cursor()
cursor.execute(
"INSERT INTO footer_seed_url_status(seed_url_id, status) VALUES(%s, %s)",
(seed_url_id, str(e)))

self.handler.connection.commit()

def errback(self, failure):
print(failure.value)

try:

error = repr(failure.value)
request = failure.request

seed_url_id = request.meta.get('seed_url_id')

cursor = self.handler.connection.cursor()
cursor.execute(
"INSERT INTO footer_seed_url_status(seed_url_id, status) VALUES(%s, %s)",
(seed_url_id, error))

self.handler.connection.commit()

except Exception as e:
print(e)

这些是我的抓取自定义设置(取自上面的 broad crawl 文档页面):

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
CONCURRENT_REQUESTS = 100
CONCURRENT_ITEMS=1000
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
REACTOR_THREADPOOL_MAXSIZE = 20
COOKIES_ENABLED = False
DOWNLOAD_DELAY = 0.2

我的问题是:蜘蛛不会抓取所有网址,而是在只抓取几百个(或几千个,这个数字似乎有所不同)后停止。日志中不会显示任何警告或错误。这些是“完成”抓取后的示例日志:

{'downloader/exception_count': 2,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 1,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 1,
'downloader/request_bytes': 345073,
'downloader/request_count': 1481,
'downloader/request_method_count/GET': 1481,
'downloader/response_bytes': 1977255,
'downloader/response_count': 1479,
'downloader/response_status_count/200': 46,
'downloader/response_status_count/301': 791,
'downloader/response_status_count/302': 512,
'downloader/response_status_count/303': 104,
'downloader/response_status_count/308': 2,
'downloader/response_status_count/403': 2,
'downloader/response_status_count/404': 22,
'dupefilter/filtered': 64,
'elapsed_time_seconds': 113.895788,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 8, 3, 11, 46, 31, 889491),
'httpcompression/response_bytes': 136378,
'httpcompression/response_count': 46,
'log_count/ERROR': 3,
'log_count/INFO': 11,
'log_count/WARNING': 7,
'response_received_count': 43,
"robotstxt/exception_count/<class 'twisted.internet.error.DNSLookupError'>": 1,
"robotstxt/exception_count/<class 'twisted.web._newclient.ResponseNeverReceived'>": 1,
'robotstxt/request_count': 105,
'robotstxt/response_count': 43,
'robotstxt/response_status_count/200': 21,
'robotstxt/response_status_count/403': 2,
'robotstxt/response_status_count/404': 20,
'scheduler/dequeued': 151,
'scheduler/dequeued/memory': 151,
'scheduler/enqueued': 151,
'scheduler/enqueued/memory': 151,
'start_time': datetime.datetime(2023, 8, 3, 11, 44, 37, 993703)}
2023-08-03 11:46:31 [scrapy.core.engine] INFO: Spider closed (finished)

奇怪的是,这个问题似乎只出现在我尝试用于爬行的两台机器中的一台上。当我在我的电脑(Windows 11)上本地运行爬网时,爬网不会停止。但是,当我在我们公司的服务器(Microsoft Azure Windows 10 计算机)上运行代码时,爬网会提前停止,如上所述。

编辑:可以找到完整日志 here 。在这种情况下,进程会在几个 URL 后停止。

最佳答案

终于找到问题所在了。 Scrapy 要求所有起始 URL 都具有 HTTP 架构,例如stackoverflow.com 不起作用,但 https://stackoverflow.com 可以。

我使用以下代码来验证 url 是否包含架构:

if not url.startswith("http"):
url = 'http://' + url

但是,这个验证是错误的。我的数据包含数百万个网址,其中一些显然是退化或非常规的( http.gay 似乎是有效的重定向域),例如:

httpsf52u5bids65u.xyz
httppollenmap.com
http.gay

这些网址将通过我的方案检查,即使它们不包含架构,并且它们会破坏我的抓取过程。

我将验证更改为这样,问题就消失了:

if not (url.startswith("http://") or url.startswith('https://')):
url = 'http://' + url

关于python - Scrapy爬取完成,没有爬取所有启动请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76827831/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com