gpt4 book ai didi

python - 拒绝 scrapy linkextractor 中的某些链接

转载 作者:行者123 更新时间:2023-11-28 17:58:10 27 4
gpt4 key购买 nike

with open('/home/timmy/myamazon/bannedasins.txt') as f:
banned_asins = f.read().split('\n')

class AmazonSpider(CrawlSpider):

name = 'amazon'
allowed_domains = ['amazon.com',]

rules = (
Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a')),
Rule(LinkExtractor(restrict_xpaths='//h2/a[@class="a-link-normal a-text-normal"]',
process_value= lambda i:f"https://www.amazon.com/dp/{re.search('dp/(.*)/',i).groups()[0]}"),
callback="parse_item"),
)

我有以下两条规则来提取正确工作的亚马逊产品链接,现在我想从搜索中删除一些 Asins re.search('dp/(.*)/',i).groups() [0] 这会检索 ASIN 并将其置于格式 https://www.amazon.com/dp/{ASIN} 中,我想要做的是——如果 asin在 banned_asins 中不要提取它。

看完Link Extractors Scrapy doc ,我相信它是由 deny_extensions 完成的,但不确定如何使用

banned_asins= ['B07RTX74L7','B07D9JCH5X',......]

最佳答案

deny_extensions 将不起作用,它指的是在链接中出现时未遵循的常见文件扩展名,请参阅 here如果没有给出默认值。

您只需在 process_value 中过滤掉被禁止的 asins功能。如果它返回 None,给定的链接将被忽略:

process_value (callable)

a function which receives each value extracted from the tag and attributes scanned and can modify the value and return a new one, or return None to ignore the link altogether. If not given, process_value defaults to lambda x: x.

所以应该是:

def process_value(i):
asin = re.search('dp/(.*)', i).groups()[0]
return f"https://www.amazon.com/dp/{asin}" if asin not in banned_asins else None

....

    rules = (
Rule(LinkExtractor(restrict_xpaths='//li[@class="a-last"]/a')),
Rule(LinkExtractor(restrict_xpaths='//h2/a[@class="a-link-normal a-text-normal"]',
process_value=process_value), callback="parse_item"),
)

关于python - 拒绝 scrapy linkextractor 中的某些链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57137698/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com