python - 如果 Python Scrapy 中的蜘蛛已经看到了 Item()，如何省略对 Item() 的处理-6ren

python - 如果 Python Scrapy 中的蜘蛛已经看到了 Item()，如何省略对 Item() 的处理

转载作者：行者123 更新时间：2023-12-01 02:06:47

26

4

我正在尝试在蜘蛛爬行期间删除重复的business_names。但是，我仍然看到重复的business_names。

我尝试了 if x != item['business_name'] 继续解析。

我想要的是，如果business_name尚不存在，则解析它，如果不存在，则从列表中删除或跳过查询结果。

相反，下面的代码会忽略我的 if 语句；这是我到目前为止所拥有的。

class Item(scrapy.Item):
    business_name = scrapy.Field()
    website = scrapy.Field()
    phone_number = scrapy.Field()

class QuotesSpider(scrapy.Spider):

    def parse(self, response):
        for business in response.css('div.info'):
            item = Item()
            item['business_name'] = business.css('span[itemprop="name"]::text').extract()
            for x in item['business_name']:
                if (x != item['business_name']):
                    if item['business_name']:
                        item['website']  = business.css('div.links  a::attr(href)').extract_first()
                        if item['website']:
                            item['phone_number'] = business.css('div.phones.phone.primary::text').extract()
                            yield item

最佳答案

您看到此行为的原因是范围问题。您将 item['business_name'] = 设置为 .extract() 的结果，它始终是一个列表(即使只有一个成功的 css.tag。

然后代码迭代 item['business_name'] 并检查列表中的每个元素是否 =!项目['business_name']

事实证明，这永远是True。

相当于执行以下操作:

numbers = [1, 2 , 3, 4]
for x in numbers:
    if x != numbers:
        print(x)

#output
1
2
3
4

相反，在 for 循环之外初始化一个列表，并检查值是否在该列表中。例如，效果如下:

def parse(self, response):

    for business in response.css('div.info'):
        seen_business_names = []
        item = Item()
        item['business_name'] = business.css('span[itemprop="name"]::text').extract()
        for x in item['business_name']:
            if (x not in seen_business_names):
                if item['business_name']: # not sure why this is here unless it is possible you are extracting empty strings
                    item['website']  = business.css('div.links  a::attr(href)').extract_first()
                    if item['website']:
                        item['phone_number'] = business.css('div.phones.phone.primary::text').extract()
                        seen_business_names.append(x)
                        yield item

我无权访问您的 html 文件，因此我不能保证上述代码能够正常工作，但是根据您在原始帖子中提供的代码，您所面临的行为是预期的。

旁注:上述解决方案中的列表只会在每次调用 parse 步骤时保留。换句话说，对于传递给 parse 的每个 start_url。如果您想确保在 Spider 类的生命周期内，对于传递给 parse 的任何页面，全局提取唯一一个 business_name，我们可以在类定义中维护一个列表并以与我们在本地进行解析相同的方式对其进行检查。考虑:

class Item(scrapy.Item):
    business_name = scrapy.Field()
    website = scrapy.Field()
    phone_number = scrapy.Field()

class QuotesSpider(scrapy.Spider):
    #new code here
    def __init__(self):
        self.seen_business_names = []

    def parse(self, response):
        for business in response.css('div.info'):
            item = Item()
            item['business_name'] = business.css('span[itemprop="name"]::text').extract()
            for x in item['business_name']:
                #new code here, call to self.seen_business_names
                if (x not in self.seen_business_names):
                    if item['business_name']:
                        item['website']  = business.css('div.links  a::attr(href)').extract_first()
                        if item['website']:
                            item['phone_number'] = business.css('div.phones.phone.primary::text').extract()
                            #new code here, call to self.seen_business_names 
                            self.seen_business_names.append(x)
                            yield item

干杯!

关于python - 如果 Python Scrapy 中的蜘蛛已经看到了 Item()，如何省略对 Item() 的处理，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48978872/

26

4

0

文章推荐： python - OpenCV 会覆盖我之前保存的视频。 python 3

文章推荐： Jquery 将单击和更改合并为一次执行

文章推荐： debugging - Bootstrap 默认调试脚本代码 - 有必要吗？

文章推荐： javascript - 如何捕获文本区域中 Escape 按钮的按下？

c# - 定时器不会死(已经!)
一段时间以来，我一直在做这个反复出现的噩梦(阅读 - 我的应用程序中的错误)。出于某种原因，某个计时器在我停止后继续发送“Elapsed”事件，即使在事件本身计时器“承认”已被禁用!检查一下: /
git:我如何找到两个分支的共同祖先......已经 merge
为了找到 2 个 git 分支的共同祖先，需要做的是: git merge-base branch another_branch 好的。但是……如果两个分支都已经 merge 了怎么办？当我在这种情况
javascript - Javascript 已经 react 了吗？
关闭。这个问题是opinion-based .它目前不接受答案。想改进这个问题？更新问题，以便 editing this post 可以用事实和引用来回答它. 1年前关闭。 Improve this
iphone - iOS 相机捕捉 View 已经？
我想要一个相机 View ，可以将图像捕获到本地文件或让用户从本地照片库中选择图像。我想也许有人为此编写了很好的库/代码。也许我可以利用它。已经有好的了吗？谢谢。我只是避免重新发明轮子:) 最佳答案
git rebase 已经 merge 了分支？
我从 master 分支创建了一个功能分支。之后有来自功能分支的提交 [F1]。 [F1] -- Feature Branch / [M1]-[M2
c# - Linux 上的 WPF(已经)可能吗？
我喜欢使用 .NET 进行编程，尤其是 C# 3.0、.NET 3.5 和 WPF。但我特别喜欢的是 Mono .NET 确实与平台无关。现在我听说了 Mono 中的 Olive 项目。我找不到某种
git - 如何调和分离的头和起源/主人？已经 checkout 旧哈希
介绍和搜索所以我认为我犯了一个严重的错误，我很担心。我已经分析了独立负责人的论坛，我已经接近找到答案，但场景太具体，不适用于我所在的位置。如果您找到可以回答我的问题的特定主题，请链接我。例如:Ho
git - 如何重新 merge 已经 merge 的分支？
我有一个类似于下图的提交图。标记为 * 的提交表示大量提交。 A* | B--------- | | C* D* master 和 cor
c# - Linux 上的 WPF(已经)可能吗？
我喜欢使用 .NET 进行编程，尤其是 C# 3.0、.NET 3.5 和 WPF。但我特别喜欢的是 Mono .NET 确实与平台无关。现在我听说了 Mono 中的 Olive 项目。我找不到某种
asp.net - 已经 Html 编码的 Html 编码值
我们最近接手了一个 .NET 项目，在查看 db 后，我们在某些列中有以下内容: 1)某些列具有诸如" & etc etc 2) 有些有标签和其他非 html 编码的标签这些数据
flutter - 未处理的异常:不良状态: future 已经 flutter 朔迷离
你好，当我导航到应用程序中的另一个页面时出现此错误我不知道为什么这个错误出现 #0 _AsyncCompleter.complete (dart:async/future_impl.da
已经 CRC32 处理的数据的 CRC32 附加了 CRC 数据
我使用以下 C 算法计算数据的 CRC32: #define CRC32_POLYNOM_REVERSED 0xEDB88320 uint32 calcCrc32(uint8* buffer, u
python - 是否有(已经)一种方法来比较 2 个模型实例，一个字段一个字段，看它们是否相等？
我试图在我的一个测试中断言模型中的字段没有改变。我知道从哲学上这是不正确的，但由于我控制了我需要知道的所有变量，所以我只想检查我的数据库条目是否没有改变。我愿意接受一个解决方案，该解决方案可以将其转
git - 你的分支和 'origin/master' 已经 fork ，所有冲突都已修复但你仍在 merge
我是 GitHub 的新手。并通过 Eclipse 使用它我们是两个人在开发一个应用程序。当我在 Git shell 中检查 git status 时，我得到以下状态。 On branch maste
c++ - 已经 "EOF"ed ifstream 上的 peek() 是否继续返回 EOF？
简单代码: std::ifstream file("file.txt"); std::string line; while(getline(file,line)) ; //exhaust file
android - Gradle DSL 方法未找到 : 'compile()' - Dependencies are in Module level build. gradle 已经
是的，我又找不到这个 Gradle DSL 方法:'compile()' 问题。我检查了我有: buildscript { repositories { jcenter()
jquery - 如何使用 jQuery 覆盖(已经)定义的带有嵌套 anchor 标记的 DIV 类的 CSS 样式？
HTML: articles CSS: #main_menu { float: left; padding-top: 10px; vertical-align: m

首页

博学

6Ren·AI

商城

python - 如果 Python Scrapy 中的蜘蛛已经看到了 Item()，如何省略对 Item() 的处理