gpt4 book ai didi

python - 从亚马逊搜索页面抓取 ASIN

转载 作者:太空狗 更新时间:2023-10-30 01:30:55 25 4
gpt4 key购买 nike

我尝试在亚马逊上抓取 ASIN 编号。请注意,这与产品详细信息无关(如:https://www.youtube.com/watch?v=qRVRIh3GZgI),但这是当您搜索关键字时(在本例中为“trimmer”,试试这个: https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2 ).结果是很多产品,我能够抓取所有的标题。

不可见的是 ASIN(这是一个唯一的亚马逊编号)。我在检查 HTML 时看到文本 (href) 中的一个链接,其中包含 ASIN 编号。在下面的示例中,ASIN = B01MSHQ5IQ

<a class="a-link-normal a-text-normal" href="/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ/ref=sr_1_3?keywords=trimmer&amp;qid=1554462204&amp;s=gateway&amp;sr=8-3">

以我的问题结束:如何检索页面上的所有产品标题和 ASIN 编号?例如:

Number     Title                       ASIN
1 Braun, Beardtrimmer B07JH1LLYR
2 TNT Pro Series Waist B00R84J2PK
... ... ...

到目前为止,我正在使用 scrapy(但也对其他 Python 解决方案开放)并且我能够抓取标题。

到目前为止我的代码:

首先在命令行中运行:

scrapy startproject tutorial

然后,调整Spider中的文件(见例1)和items.py(见例2)。

例子一

class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]

#Use working product URL below
start_urls = [
"https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"

]
## scrapy crawl AmazonDeals -o Asin_Titles.json

def parse(self, response):
items = AmazonItem()


Title = response.css('.a-text-normal').css('::text').extract()
items['title_Products'] = Title
yield items

应@glhr 的要求,添加items.py 代码:

示例 2

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Products = scrapy.Field()

最佳答案

您可以通过提取 href 来获取产品链接<a class="a-link-normal a-text-normal" href="..."> 的属性:

Link = response.css('.a-text-normal').css('a::attr(href)').extract()

从链接中,您可以使用正则表达式从链接中提取 ASIN 编号:

(?<=dp/)[A-Z0-9]{10}

上面的正则表达式将匹配以 dp/ 开头的 10 个字符(大写字母或数字) .在此处查看演示:https://regex101.com/r/mLMv3k/1

这是 parse() 的有效实现方法:

def parse(self, response):
Link = response.css('.a-text-normal').css('a::attr(href)').extract()
Title = response.css('span.a-text-normal').css('::text').extract()

# for each product, create an AmazonItem, populate the fields and yield the item
for result in zip(Link,Title):
item = AmazonItem()
item['title_Product'] = result[1]
item['link_Product'] = result[0]
# extract ASIN from link
ASIN = re.findall(r"(?<=dp/)[A-Z0-9]{10}",result[0])[0]
item['ASIN_Product'] = ASIN
yield item

这需要扩展 AmazonItem新领域:

class AmazonItem(scrapy.Item):
# define the fields for your item here like:
title_Product = scrapy.Field()
link_Product = scrapy.Field()
ASIN_Product = scrapy.Field()

示例输出:

{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multigroom Series 3000, 13 attachments, '
'FFP, MG3750'}
{'ASIN_Product': 'B01MSHQ5IQ',
'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
'title_Product': 'Philips Norelco Multi Groomer MG7750/49-23 piece, beard, '
'body, face, nose, and ear hair trimmer, shaver, and clipper'}

演示:https://repl.it/@glhr/55534679-AmazonSpider

要将输出写入 JSON 文件,只需在蜘蛛中指定提要导出设置即可:

class AmazonProductSpider(scrapy.Spider):
name = "AmazonDeals"
allowed_domains = ["amazon.com"]
start_urls = ["https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"]
custom_settings = {
'FEED_URI' : 'Asin_Titles.json',
'FEED_FORMAT' : 'json'
}

关于python - 从亚马逊搜索页面抓取 ASIN,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55534679/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com