gpt4 book ai didi

python - Scrapy在爬行时不处理所有页面

转载 作者:太空宇宙 更新时间:2023-11-03 15:24:13 25 4
gpt4 key购买 nike

我用scrapy创建爬虫。并创建一些脚本来抓取许多页面。

不幸的是,并非所有脚本都会抓取所有页面。有些页面返回所有页面,而其他页面仅返回 23 个或可能 180 个(每个 URL 结果不同)。

import scrapy

class BotCrawl(scrapy.Spider)
name = "crawl-bl2"
start_urls = [
'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93',
]

def parse(self, response):
for product in response.css("ul[class='products row-grid']"):
for product in product.css('li'):
yield {
'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(),

'penjual': product.css('h5[class=user__name] a::attr(href)').extract(),

'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(),

'kota': product.css('div[class=user-city] a::text').extract(),

'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract()

}

# next page

next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))

它阻止了 http 请求,或者可能是我的代码出现了错误?

Granitosaurus编辑后更新代码

仍然错误

return blank array

import scrapy


class BotCrawl(scrapy.Spider):
name = "crawl-bl2"
start_urls = [
'http://www.bukalapak.com/c/perawatan-kecantikan/perawatan-wajah?page=1&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93',
]


def parse(self, response):
products = response.css('article.product-display')
for product in products:
yield {
'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(),
'penjual': product.css('h5[class=user__name] a::attr(href)').extract(),
'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(),
'kota': product.css('div[class=user-city] a::text').extract(),
'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract()
}


# next page

next_page_url = response.css("div.pagination > a[class=next_page]::attr(href)").extract_first()
last_url = "/c/perawatan-kecantikan/perawatan-wajah?page=100&search%5Bsort_by%5D=last_relist_at%3Adesc&utf8=%E2%9C%93"
if next_page_url is not last_url:
yield scrapy.Request(response.urljoin(next_page_url),dont_filter=True)

谢谢

最佳答案

你们的产品xpath有点不可靠。直接尝试选择产品文章,该网站使您可以轻松使用 css 选择器:

products = response.css('article.product-display')
for product in products:
yield {
'judul': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::text').extract(),
'penjual': product.css('h5[class=user__name] a::attr(href)').extract(),
'link': product.css('a[class="product__name line-clamp--2 js-tracker-product-link"]::attr(href)').extract(),
'kota': product.css('div[class=user-city] a::text').extract(),
'harga': product.css('div[class=product-price]::attr(data-reduced-price)').extract()
}

您可以通过插入inspect_response来调试响应:

def parse(self, response):
products = response.css('article.product-display')
if not products:
from scrapy.shell import inspect_response
inspect_response(response, self)
# will open up python shell here where you can check `response` object
# try `view(response)` to open it up in your browser and such.

关于python - Scrapy在爬行时不处理所有页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43270069/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com