gpt4 book ai didi

python - 如何通过链接抓取我需要的信息

转载 作者:太空宇宙 更新时间:2023-11-03 20:46:39 27 4
gpt4 key购买 nike

我必须从产品页面获取所有评论文本和分数,我设法:

通过向包含单个产品评论的页面添加手动链接,我可以从页面(包括其他评论页面)获取所有评论和分数

为了加快此过程,我想从类别页面转到产品页面,并在完成此操作后获取所有评论和分数,然后继续转到另一个产品。

import scrapy


class ReviewAutoSpider(scrapy.Spider):
name = 'automatic'

start_urls = ['https://www.ceneo.pl/Gry_bez_pradu']

def parse(self, response):
# follow links to website with review
for href in response.css('a.product-rewiews-link + a::attr(href)'):
yield response.follow(href, self.parse_link)

# follow pagination links
#for href in response.css('li.arrow-next a::attr(href)'):
# yield response.follow(href, self.parse)

def parse_link(self, response):
#get all reviews+score on page
for review in response.css('li.review-box'):
yield {
'score': review.css('span.review-score-count::text').get(),
'text': review.css('p.product-review-body::text').getall(),
}
# follow pagination links
for href in response.css('li.arrow-next a::attr(href)'):
yield response.follow(href, callback=self.parse)

最佳答案

好的,以下解决方案应该有效。您获得的链接仅包含链接的第二部分“/19838632”,您需要使用 response.urljoin('/19838632') 来获取完整链接。此外,蜘蛛当前的设置方式将同时向站点发出大量请求,因此我强烈建议使用代理服务。

`Python

import scrapy
class ReviewAutoSpider(scrapy.Spider):

name = 'automatic'

start_urls = ['https://www.ceneo.pl/Gry_bez_pradu']

def parse(self, response):
# follow links to website with review
for href in response.css('a.product-rewiews-link + a::attr(href)'):
yield scrapy.Request(href, callback = self.parse)

for href in response.css('.cat-prod-row-name a::attr(href)').extract():
link = response.urljoin(href)
yield scrapy.Request(link, callback = self.parse)

next_page_link = response.css('li[class ="page-arrow arrow-next"] a::attr(href)').extract_first()
next_page_link = response.urljoin(next_page_link)
yield scrapy.Request(next_page_link, callback = self.parse)


def parse_link(self, response):
#get all reviews+score on page
for review in response.css('li.review-box'):
yield {
'score': review.css('span.review-score-count::text').get(),
'text': review.css('p.product-review-body::text').getall(),
}
# follow pagination links
for href in response.css('li.arrow-next a::attr(href)'):
yield scrapy.Request(href, callback = self.parse)

`

关于python - 如何通过链接抓取我需要的信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56546224/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com