gpt4 book ai didi

python - 不确定如何 XPath 到特定的网站元素

转载 作者:行者123 更新时间:2023-12-01 04:26:10 26 4
gpt4 key购买 nike

我目前正在尝试使用 Scrapy 浏览 Elite Dangerous subreddit 并收集帖子标题、网址和投票数。我前两个做得很好,但不确定如何编写 XPath 表达式来访问投票。

selector.xpath('//div[@class="score unvoted"]').extract()有效,但它返回当前页面上所有帖子的投票计数(而不是每个单独的帖子)。 response.css('div.score.unvoted').extract()适用于每个单独的帖子,但返回 [u'<div class="score unvoted">1</div>'] ,而不仅仅是 1。(我也很想知道如何使用 XPath 做到这一点!:))

代码如下:

class redditSpider(CrawlSpider):  # http://doc.scrapy.org/en/1.0/topics/spiders.html#scrapy.spiders.CrawlSpider
name = "reddits"
allowed_domains = ["reddit.com"]
start_urls = [
"https://www.reddit.com/r/elitedangerous",
]

rules = [
Rule(LinkExtractor(
allow=['/r/EliteDangerous/\?count=\d*&after=\w*']), # Looks for next page with RE
callback='parse_item', # What do I do with this? --- pass to self.parse_item
follow=True), # Tells spider to continue after callback
]

def parse_item(self, response):
selector_list = response.css('div.thing') # Each individual little "box" with content

for selector in selector_list:
item = RedditItem()
item['title'] = selector.xpath('div/p/a/text()').extract()
item['url'] = selector.xpath('a/@href').extract()
# item['votes'] = selector.xpath('//div[@class="score unvoted"]')
item['votes'] = selector.css('div.score.unvoted').extract()
yield item

最佳答案

您走在正确的道路上。第一种方法只需要两件事:

修复版本:

selector.xpath('.//div[@class="score unvoted"]/text()').extract()

而且,仅供引用,您也可以使用 ::text pseudo-element 来使第二个选项起作用。 :

response.css('div.score.unvoted::text').extract()

关于python - 不确定如何 XPath 到特定的网站元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33114155/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com