gpt4 book ai didi

python - Scrapy网站爬虫返回无效路径错误

转载 作者:太空宇宙 更新时间:2023-11-03 17:23:01 26 4
gpt4 key购买 nike

我是 Scrapy 新手,正在关注基本文档。

我有一个网站,我正在尝试从中抓取一些链接,然后导航其中的一些链接。我特别想获取 Cokelore、College 和 Computers,并且我正在使用下面的代码

import scrapy 

class DmozSpider(scrapy.Spider):
name = "snopes"
allowed_domains = ["snopes.com"]
start_urls = [
"http://www.snopes.com/info/whatsnew.asp"
]

def parse(self, response):
print response.xpath('//div[@class="navHeader"]/ul/')
filename = response.url.split("/")[-2] + '.html'
with open(filename, 'wb') as f:
f.write(response.body)

这是我的错误

2015-10-03 23:17:29 [scrapy] INFO: Enabled item pipelines: 
2015-10-03 23:17:29 [scrapy] INFO: Spider opened
2015-10-03 23:17:29 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-10-03 23:17:29 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-10-03 23:17:30 [scrapy] DEBUG: Crawled (200) <GET http://www.snopes.com/info/whatsnew.asp> (referer: None)
2015-10-03 23:17:30 [scrapy] ERROR: Spider error processing <GET http://www.snopes.com/info/whatsnew.asp> (referer: None)
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/Users/Gaby/Documents/Code/School/689/tutorial/tutorial/spiders/dmoz_spider.py", line 11, in parse
print response.xpath('//div[@class="navHeader"]/ul/')
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/http/response/text.py", line 109, in xpath
return self.selector.xpath(query)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/selector/unified.py", line 100, in xpath
raise ValueError(msg if six.PY3 else msg.encode("unicode_escape"))
ValueError: Invalid XPath: //div[@class="navHeader"]/ul/
2015-10-03 23:17:30 [scrapy] INFO: Closing spider (finished)
2015-10-03 23:17:30 [scrapy] INFO: Dumping Scrapy stats:

我认为我遇到的错误与我的 xpath() 中的 /ul 有关,但我不明白为什么。 //div[@class="navHeader"] 本身工作得很好,一旦我开始添加属性,它就开始崩溃。

我试图抓取的网站部分的结构如下

<DIV CLASS="navHeader">CATEGORIES:</DIV>
<UL>
<LI><A HREF="/autos/autos.asp">Autos</A></LI>
<LI><A HREF="/business/business.asp">Business</A></LI>
<LI><A HREF="/cokelore/cokelore.asp">Cokelore</A></LI>
<LI><A HREF="/college/college.asp">College</A></LI>
<LI><A HREF="/computer/computer.asp">Computers</A></LI>
</UL>
<DIV CLASS="navSpacer"> &nbsp; </DIV>
<UL>
<LI><A HREF="/crime/crime.asp">Crime</A></LI>
<LI><A HREF="/critters/critters.asp">Critter Country</A></LI>
<LI><A HREF="/disney/disney.asp">Disney</A></LI>
<LI><A HREF="/embarrass/embarrass.asp">Embarrassments</A></LI>
<LI><A HREF="/photos/photos.asp">Fauxtography</A></LI>
</UL>

最佳答案

您只需删除尾随的 / 即可。替换:

//div[@class="navHeader"]/ul/

与:

//div[@class="navHeader"]/ul

请注意,此 XPath 实际上不会与页面上的任何内容匹配。 ul 元素是导航 header 的同级元素 - 使用 following-sibling :

In [1]: response.xpath('//div[@class="navHeader"]/following-sibling::ul//li/a/text()').extract()
Out[1]:
[u'Autos',
u'Business',
u'Cokelore',
u'College',
# ...
u'Weddings']

关于python - Scrapy网站爬虫返回无效路径错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32930118/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com