gpt4 book ai didi

python - scrapy-playwright :- Downloader/handlers: scrapy. exceptions.NotSupported: AsyncioSelectorReactor

转载 作者:行者123 更新时间:2023-12-05 04:40:25 25 4
gpt4 key购买 nike

我尝试使用 scrapy-playwright 从动态加载的 javascript 网站中提取一些数据,但我在一开始就卡住了。

我在 settings.py 文件中遇到的问题如下:

#剧作家

 DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

#TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
#ASYNCIO_EVENT_LOOP = 'uvloop.Loop'

当我注入(inject)以下 scrapy-playwright 处理程序时:

DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}

然后我得到:

scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': The installed reactor 
(twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)

当我注入(inject) TWISTED_REACTOR 时

TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'

然后我得到:

 raise TypeError(
TypeError: SelectorEventLoop required, instead got: <ProactorEventLoop running=False closed=False debug=False>

毕竟,当我注入(inject) ASYNCIO_EVENT_LOOP 时

然后我得到:

ModuleNotFoundError: No module named 'uvloop'

最后,安装'uvloop'失败

pip install uvloop

脚本

import scrapy
from scrapy_playwright.page import PageCoroutine

class ProductSpider(scrapy.Spider):
name = 'product'

def start_requests(self):
yield scrapy.Request(
'https://shoppable-campaign-demo.netlify.app/#/',
meta={
'playwright': True,
'playwright_include_page': True,
'playwright_page_coroutines': [
PageCoroutine("wait_for_selector", "div#productListing"),
]
}
)

async def parse(self, response):
pass
# parses content

最佳答案

scrapy_playwright 的开发人员建议将 DOWNLOAD_HANDLERSTWISTER_REACTOR 实例化到您的脚本中。

提供了类似的评论here

这是一个实现这个的工作脚本:

import scrapy
from scrapy_playwright.page import PageCoroutine
from scrapy.crawler import CrawlerProcess

class ProductSpider(scrapy.Spider):
name = 'product'

def start_requests(self):
yield scrapy.Request(
'https://shoppable-campaign-demo.netlify.app/#/',
callback = self.parse,
meta={
'playwright': True,
'playwright_include_page': True,
'playwright_page_coroutines': [
PageCoroutine("wait_for_selector", "div#productListing"),
]
}
)

async def parse(self, response):
container = response.xpath("(//div[@class='col-md-6'])[1]")
for items in container:
yield {
'products':items.xpath("(//h3[@class='card-title'])[1]//text()").get()
}
# parses content

if __name__ == "__main__":
process = CrawlerProcess(
settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"CONCURRENT_REQUESTS": 32,
"FEED_URI":'Products.jl',
"FEED_FORMAT":'jsonlines',
}
)
process.crawl(ProductSpider)
process.start()

我们得到以下输出:

{'products': 'Oxford Loafers'}

关于python - scrapy-playwright :- Downloader/handlers: scrapy. exceptions.NotSupported: AsyncioSelectorReactor,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70275302/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com