gpt4 book ai didi

python - Scrapy:使用正则表达式跟踪链接

转载 作者:太空宇宙 更新时间:2023-11-03 11:26:19 30 4
gpt4 key购买 nike

我想从德国论坛中抓取主题。 http://www.musiker-board.de/

实际不同的子论坛位于http://www.musiker-board.de/forum

子论坛:musiker-board.de/forum/subforumname

实际线程有这些地址:musiker-board.de/threads/threadname

我想关注所有子论坛的所有链接并提取其中的所有主题,但是线程的 URL 不再与起始 URL 匹配。

但是,如果我选择“musiker-board.de/”作为起始 URL,它不会跟随所有子论坛的链接。

代码如下:

allowed_domains = ["musiker-board.de"]
start_urls = ['http://www.musiker-board.de/forum/'
]
rules = (
Rule(SgmlLinkExtractor(allow=[r'forum/\w+']), follow=True),
Rule(SgmlLinkExtractor(allow=[r'threads/\w+']), callback='parse_item'),
)

def parse_item(self, response):
#extract items...

我应该怎么做才能关注所有 musiker-board.de/forum/subforum 并提取所有 musiker-forum.de/threads/threadname ?

最佳答案

以下代码(根据您的代码段制作)似乎工作正常:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class Scrapy1Spider(CrawlSpider):

name = "musiker"
allowed_domains = ["musiker-board.de"]
start_urls = ['http://www.musiker-board.de/forum/'
]
rules = (
Rule(LinkExtractor(allow=[r'forum/\w+']), follow=True),
Rule(LinkExtractor(allow=[r'threads/\w+']), callback='parse_item'),
)

def parse_item(self, response):
self.logger.info('response.url=%s' % response.url)

至少有这个输出(截断):

INFO: response.url=http://www.musiker-board.de/threads/peavey-ms-412-userthread.271458/
INFO: response.url=http://www.musiker-board.de/threads/peavey-5150-6505-etc-userthread.180295/
INFO: response.url=http://www.musiker-board.de/threads/marshall-ma-serie-user-thread.386428/
INFO: response.url=http://www.musiker-board.de/threads/h-k-metal-master-shredder-user-thread.250846/
INFO: response.url=http://www.musiker-board.de/threads/hughes-und-kettner-grandmeister-user-thread.553487/
INFO: response.url=http://www.musiker-board.de/threads/ibanez-userthread.190547/
INFO: response.url=http://www.musiker-board.de/threads/hughes-kettner-edition-blue-user-thread.209499/page-2
INFO: response.url=http://www.musiker-board.de/threads/fender-prosonic-userthread.239519/
INFO: response.url=http://www.musiker-board.de/threads/fender-prosonic-userthread.239519/page-5
INFO: response.url=http://www.musiker-board.de/threads/engl-steve-morse-signature-e656-user-thread.427802/page-2
INFO: response.url=http://www.musiker-board.de/threads/engl-sovereign-user-thread.136266/page-20
INFO: response.url=http://www.musiker-board.de/threads/engl-steve-morse-signature-e656-user-thread.427802/
INFO: response.url=http://www.musiker-board.de/threads/engl-sovereign-user-thread.136266/page-19
INFO: response.url=http://www.musiker-board.de/threads/engl-sovereign-user-thread.136266/page-18
INFO: response.url=http://www.musiker-board.de/threads/engl-invader-user-thread.248090/page-5
INFO: response.url=http://www.musiker-board.de/threads/engl-sovereign-user-thread.136266/
INFO: response.url=http://www.musiker-board.de/threads/engl-invader-user-thread.248090/page-4
INFO: response.url=http://www.musiker-board.de/threads/engl-invader-user-thread.248090/page-3
INFO: response.url=http://www.musiker-board.de/threads/fender-cybertwin-userthread.305789/
INFO: response.url=http://www.musiker-board.de/threads/fenders-famose-farbwelten.454766/

关于python - Scrapy:使用正则表达式跟踪链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32696774/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com