gpt4 book ai didi

python - 如何抓取我所有网页中的链接?

转载 作者:行者123 更新时间:2023-12-01 09:32:54 25 4
gpt4 key购买 nike

到目前为止,我有这段代码,可以使用 scrapy 从页面 URL 中提取文本:

class QuotesSpider(scrapy.Spider):
name = "dialpad"

def start_requests(self):
urls = [
'https://help.dialpad.com/hc/en-us/categories/201278063-User-Support',
'https://www.domo.com/',
'https://www.zenreach.com/',
'https://www.trendkite.com/',
'https://peloton.com/',
'https://ting.com/',
'https://www.cedar.com/',
'https://tophat.com/',
'https://www.bambora.com/en/ca/',
'https://www.hoteltonight.com/'
]
for url in urls:
BASE_URL = url
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split("/")[2]
filename = 'quotes-thing-{}.csv'.format(page)
BASE_URL = response.url

# with open(filename, 'wb') as f:
# f.write(response.body)
# # with open(filename, 'r') as f:
with open(filename, 'w') as f:
for selector in response.css('body').xpath('.//text()'):
selector = selector.extract()
f.write(selector)

我如何从这些页面上的链接中提取数据并将其写入我创建的文件名中?

最佳答案

您可以使用CrawlSpider要提取每个链接并抓取它们,您的代码可能如下所示

from scrapy.linkextractors import LinkExtractor
from scrapy.spider import CrawlSpider, Rule


class QuotesSpider(CrawlSpider):
name = "dialpad"

start_urls = [
'https://help.dialpad.com/hc/en-us/categories/201278063-User-Support',
'https://www.domo.com/',
'https://www.zenreach.com/',
'https://www.trendkite.com/',
'https://peloton.com/',
'https://ting.com/',
'https://www.cedar.com/',
'https://tophat.com/',
'https://www.bambora.com/en/ca/',
'https://www.hoteltonight.com/'
]

rules = [
Rule(
LinkExtractor(
allow=(r'url patterns here to follow'),
deny=(r'other url patterns to deny'),
),
callback='parse_item',
follow=True,
)
]

def parse_item(self, response):
page = response.url.split("/")[2]
filename = 'quotes-thing-{}.csv'.format(page)

with open(filename, 'w') as f:
for selector in response.css('body').xpath('.//text()'):
selector = selector.extract()
f.write(selector)

不过我建议为每个网站创建不同的蜘蛛,并使用allowdeny参数来选择要在每个网站上提取哪些链接。

此外,使用 Scrapy Items 会更好。

关于python - 如何抓取我所有网页中的链接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49802769/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com