gpt4 book ai didi

python - 在 scrapy 中忽略已经访问过的 url

转载 作者:太空狗 更新时间:2023-10-30 01:29:14 27 4
gpt4 key购买 nike

这是我的 custom_filters.py 文件:

from scrapy.dupefilter import RFPDupeFilter

class SeenURLFilter(RFPDupeFilter):

def __init__(self, path=None):
self.urls_seen = set()
RFPDupeFilter.__init__(self, path)

def request_seen(self, request):
if request.url in self.urls_seen:
return True
else:
self.urls_seen.add(request.url)

添加行:

   DUPEFILTER_CLASS = 'crawl_website.custom_filters.SeenURLFilter'

到settings.py

当我检查生成的 csv 文件时,它多次显示一个 url。这是错误的吗?

最佳答案

发件人:http://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

def __init__(self):
self.ids_seen = set()

def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item

然后在你的 settings.py 添加:

ITEM_PIPELINES = {
'your_bot_name.pipelines.DuplicatesPipeline': 100
}

编辑:

要检查重复的 URL,请使用:

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):
def __init__(self):
self.urls_seen = set()

def process_item(self, item, spider):
if item['url'] in self.urls_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.urls_seen.add(item['url'])
return item

这需要在您的项目中使用 url = Field()。像这样的东西(items.py):

from scrapy.item import Item, Field

class PageItem(Item):
url = Field()
scraped_field_a = Field()
scraped_field_b = Field()

关于python - 在 scrapy 中忽略已经访问过的 url,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20988113/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com