gpt4 book ai didi

xpath - Scrapy找不到XPath内容

转载 作者:行者123 更新时间:2023-12-03 16:58:01 24 4
gpt4 key购买 nike

我正在使用Scrapy编写网络爬虫,以在特定网页上下载对讲文本。

这是网页背后代码的相关部分,用于特定的对讲:

<div id="site_comment_71339" class="site_comment site_comment-even large high-rank">
<div class="talkback-topic">
<a class="show-comment" data-ajax-url="/comments/71339.js?counter=97&num=57" href="/comments/71339?counter=97&num=57">57. talk back title here </a>
</div>
<div class="talkback-message"> blah blah blah talk-back message here </div>
....etc etc etc ......


在编写XPath以获得消息时:

titles = hxs.xpath("//div[@class='site_comment site_comment-even large high-rank']")


后来:

item["title"] = titles.xpath("div[@class='talkback-message']text()").extract()


没有错误,但是没有用。有什么想法吗?我想我没有正确编写路径,但是找不到错误。

谢谢 :)

整个代码:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from craigslist_sample.items import CraigslistSampleItem

class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]

def parse(self, response):
hxs = Selector(response)
titles = hxs.xpath("//div[@class='site_comment site_comment-even large high-rank']")
items=[]
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.xpath("div[@class='talkback-message']text()").extract()
items.append(item)
return items

最佳答案

这是#site_comment_74240的HTML页面的一部分

<div class="site_comment site_comment-even small normal-rank" id="site_comment_74240">
<div class="talkback-topic">
<a href="/comments/74240?counter=1&amp;num=144" class="show-comment" data-ajax-url="/comments/74240.js?counter=1&amp;num=144">144. מדיניות</a>
</div>

<div class="talkback-username">
<table><tr>
<td>קייזרמן פרדי&nbsp;</td>
<td>(01.11.2013)</td>
</tr></table>
</div>


首次获取“ talkback-message” div不在HTML页面中,而是单击注释标题时通过某些AJAX查询异步获取,因此您必须为每个注释获取它。

可以使用如下XPath来捕获代码段中的 titles注释块: //div[starts-with(@id, "site_comment_"]),即所有以字符串“ site_comment_”开头的具有“ id”属性的 div

您还可以将CSS选择器与 Selector.css()一起使用。就您而言,您可以使用“ id”方法(如我上面使用XPath所做的那样)来获取注释块,因此:

titles = sel.css("div[id^=site_comment_]")


或使用“ site_comment”类,而没有其他的“ site_comment-even”,“ site_comment-odd”,“ small”,“ normal-rank”或“ high-rank”:

titles = sel.css("div.site_comment")


然后,您将使用该注释 Request./div[@class="talkback-topic"]/a[@class="show-comment"]/@data-ajax-url中的URL发出新的 div。或使用CSS选择器 div.talkback-topic > a.show-comment::attr(data-ajax-url)(顺便说一句, ::attr(...)不是标准的,而是使用伪元素函数对CSS选择器的Scrapy扩展)

从AJAX调用中得到的是一些Javascript代码,并且您想获取 old.after(...)中的内容

var old = $("#site_comment_72765");
old.attr('id', old.attr('id') + '_small');
old.hide();
old.after("\n<div class=\"site_comment site_comment-odd large high-rank\" id=\"site_comment_72765\">\n <div class=\"talkback-topic\">\n <a href=\"/comments/72765?counter=42&amp;num=109\" class=\"show-comment\" data-ajax-url=\"/comments/72765.js?counter=42&amp;num=109\">109. ביבי - האדם הנכון בראש ממשלת ישראל(לת)<\/a>\n <\/div>\n \n <div class=\"talkback-message\">\n \n <\/div>\n \n <div class=\"talkback-username\">\n <table><tr>\n <td>ישראל&nbsp;<\/td>\n <td>(11.03.2012)<\/td>\n <\/tr><\/table>\n <\/div>\n <div class=\"rank-controllers\">\n <table><tr>\n \n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=up\"><img alt=\"\" src=\"/images/elements/thumbU.png?1376839523\" /><\/a><\/td>\n <td> | <\/td>\n <td class=\"rabk-link\"><a href=\"#\" data-thumb=\"/comments/72765/thumb?type=down\"><img alt=\"\" src=\"/images/elements/thumbD.png?1376839523\" /><\/a><\/td>\n \n <td> | <\/td>\n <td>11<\/td>\n \n <\/tr><\/table>\n <\/div>\n \n <div class=\"talkback-links\">\n <a href=\"/comments/new?add_to_root=true&amp;html_id=site_comment_72765&amp;sibling_id=72765\">תגובה חדשה<\/a>\n &nbsp;&nbsp;\n <a href=\"/comments/72765/comments/new?html_id=site_comment_72765\">הגיבו לתגובה<\/a>\n &nbsp;&nbsp;\n <a href=\"/i/offensive?comment_id=72765\" data-noajax=\"true\">דיווח תוכן פוגעני<\/a>\n <\/div>\n \n<\/div>");
var new_comment = $("#site_comment_72765");


这是您需要使用 Selector(text=this_ajax_html_data).//div[@class="talkback-message"]//text() XPath或 div.talkback-message ::text CSS选择器再次解析的HTML数据

这是一个骷髅蜘蛛,可以帮助您了解这些想法:

from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from craigslist_sample.items import CraigslistSampleItem
import urlparse
import re


class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["tbk.co.il"]
start_urls = ["http://www.tbk.co.il/tag/%D7%91%D7%A0%D7%99%D7%9E%D7%99%D7%9F_%D7%A0%D7%AA%D7%A0%D7%99%D7%94%D7%95/talkbacks"]

def parse(self, response):
sel = Selector(response)
comments = sel.css("div.site_comment")
for comment in comments:
item = CraigslistSampleItem()
# this probably has to be fixed
#item["title"] = comment.xpath("div[@class='talkback-message']text()").extract()

# issue an additional request to fetch the Javascript
# data containing the comment text
# and pass the incomplete item via meta dict
for url in comment.css('div.talkback-topic > a.show-comment::attr(data-ajax-url)').extract():
yield Request(url=urlparse.urljoin(response.url, url),
callback=self.parse_javascript_comment,
meta={"item": item})
break

# the line we are looking for begins with "old.after"
# and we want everythin inside the parentheses
_re_comment_html = re.compile(r'^old\.after\((?P<html>.+)\);$')
def parse_javascript_comment(self, response):
item = response.meta["item"]
# loop on Javascript content lines
for line in response.body.split("\n"):
matching = self._re_comment_html.search(line.strip())
if matching:
# what's inside the parentheses is a Javascript strings
# with escaped double-quotes
# a simple way to decode that into a Python string
# is to use eval()
# then there are these "<\/tag>" we want to remove
html = eval(matching.group("html")).replace(r"<\/", "</")

# once we have the HTML snippet, decode it using Selector()
decoded = Selector(text=html, type="html")

# and save the message text in the item
item["message"] = u''.join(decoded.css('div.talkback-message ::text').extract()).strip()
# and return it
return item


您可以使用 scrapy runspider tbkspider.py尝试一下。

关于xpath - Scrapy找不到XPath内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20224998/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com