gpt4 book ai didi

python - 如何使用scrapy抓取xml url

转载 作者:行者123 更新时间:2023-12-01 05:57:56 24 4
gpt4 key购买 nike

嗨,我正在使用 scrapy 来抓取 xml url

假设下面是我的 Spider.py 代码

class TestSpider(BaseSpider):
name = "test"
allowed_domains = {"www.example.com"}


start_urls = [
"https://example.com/jobxml.asp"
]


def parse(self, response):
print response,"??????????????????????"

结果:

2012-07-24 16:43:34+0530 [scrapy] INFO: Scrapy 0.14.3 started (bot: testproject)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Enabled item pipelines:
2012-07-24 16:43:34+0530 [test] INFO: Spider opened
2012-07-24 16:43:34+0530 [test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2012-07-24 16:43:34+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2012-07-24 16:43:36+0530 [testproject] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 1 times): 400 Bad Request
2012-07-24 16:43:37+0530 [test] DEBUG: Retrying <GET https://example.com/jobxml.asp> (failed 2 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Gave up retrying <GET https://example.com/jobxml.asp> (failed 3 times): 400 Bad Request
2012-07-24 16:43:38+0530 [test] DEBUG: Crawled (400) <GET https://example.com/jobxml.asp> (referer: None)
2012-07-24 16:43:38+0530 [test] INFO: Closing spider (finished)
2012-07-24 16:43:38+0530 [test] INFO: Dumping spider stats:
{'downloader/request_bytes': 651,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 504,
'downloader/response_count': 3,
'downloader/response_status_count/400': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2012, 7, 24, 11, 13, 38, 573931),
'scheduler/memory_enqueued': 3,
'start_time': datetime.datetime(2012, 7, 24, 11, 13, 34, 803202)}
2012-07-24 16:43:38+0530 [test] INFO: Spider closed (finished)
2012-07-24 16:43:38+0530 [scrapy] INFO: Dumping global stats:
{'memusage/max': 263143424, 'memusage/startup': 263143424}

scrapy 是否不适用于 xml 抓取,如果是的话,任何人都可以给我提供一个如何抓取 xml 标签数据的示例

提前致谢..............

最佳答案

您有一个专门用于抓取 xml feed 的蜘蛛。这是来自 scrapy 文档:

XMLFeedSpider 示例

这些蜘蛛非常容易使用,让我们看一个示例:

from scrapy import log
from scrapy.contrib.spiders import XMLFeedSpider
from myproject.items import TestItem

class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecesary, since it's the default value
itertag = 'item'

def parse_node(self, response, node):
log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))

item = Item()
item['id'] = node.select('@id').extract()
item['name'] = node.select('name').extract()
item['description'] = node.select('description').extract()
return item

这是另一种不使用scrapy的方式:

这是一个用于从给定的 url 下载 xml 的函数,请注意,这里没有一些导入,这也将为您下载 xml 文件提供良好的进度。

def get_file(self, dir, url, name):
s = urllib2.urlopen(url)
f = open('xml/test.xml','w')
meta = s.info()
file_size = int(meta.getheaders("Content-Length")[0])
print "Downloading: %s Bytes: %s" % (name, file_size)
current_file_size = 0
block_size = 4096
while True:
buf = s.read(block_size)
if not buf:
break
current_file_size += len(buf)
f.write(buf)
status = ("\r%10d [%3.2f%%]" %
(current_file_size, current_file_size * 100. / file_size))
status = status + chr(8)*(len(status)+1)
sys.stdout.write(status)
sys.stdout.flush()
f.close()
print "\nDone getting feed"
return 1

然后解析您下载并使用 iterparse 保存的 xml 文件,如下所示:

for event, elem in iterparse('xml/test.xml'):
if elem.tag == "properties":
print elem.text

这只是一个如何浏览 xml 树的示例。

此外,这是我的旧代码,因此您最好使用 with 来打开文件。

关于python - 如何使用scrapy抓取xml url,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11629720/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com