gpt4 book ai didi

python - 如何定义 scrapy LinkExtractor 规则以跟踪所有以 .css 结尾的链接?

转载 作者:行者123 更新时间:2023-11-28 19:19:58 24 4
gpt4 key购买 nike

我正在尝试遵循网站的所有 css 样式表,例如https://www.thomann.de/de/index.html

我继承自 scrapy CrawlSpider 类并使用 LxmlLinkExtractor。我告诉规则在所有“链接”标签中搜索 str“css”,如下所示:

from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from items import ShopCrawlerItem

class CSSSpider(CrawlSpider):
# define unique name of spider
name = "cssspider"

# define spider specific settings
custom_settings = {
'DEPTH_LIMIT': 0,
'FEED_FORMAT': 'json',
'FEED_URI': 'data/interim/items_css.json',
}

def __init__(self, start_urls, *args, **kwargs):

# load list of start urls
self.start_urls = ["https://www.thomann.de/de/index.html"]


# define rules to find css stylesheets
self.rules = (Rule(LxmlLinkExtractor(tags="link", allow="css"), callback="parse_item", follow=True),)

def parse_item(self, response):
"""
Function to parse crawl responses.
"""
# initialize items
item = ShopCrawlerItem()

# store data as items
item["shopurl"] = response.request.url
item["html"] = response.body.decode("utf-8")

return item

但是,我的 json 文件中只收到 2 个元素:

[
{"shopurl": "https://fonts.googleapis.com/css?family=Open+Sans:300,400,700,400i&subset=latin-ext,latin", "html": "xyz"},
{"shopurl": "https://fonts.googleapis.com/css?family=Lora", "html": "xyz"}
]

在 html 源代码中找到的元素如下所示:

<link href="https://fonts.googleapis.com/css?family=Open+Sans:300,400,700,400i&amp;subset=latin-ext,latin" rel="stylesheet" type="text/css">

尽管我在 Chrome 调试器中发现了一堆以“.css”结尾的链接,例如:

<link rel="stylesheet" href="/static/nc/css/oo__rev43.css" type="text/css" media="all">
<link rel="stylesheet" href="/static/tr/css/nc-fix__rev928.css" type="text/css" media="all">

有没有人知道我在这里找不到所有 css 样式表的线索?

最佳答案

您需要相应地更新链接提取器的 tagsattrs 构造函数参数。

它们的默认值不适用于您的用例:

tags=('a', 'area'), attrs=('href',)

关于python - 如何定义 scrapy LinkExtractor 规则以跟踪所有以 .css 结尾的链接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57528949/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com