gpt4 book ai didi

python - 从多个 URL 中抓取数据

转载 作者:行者123 更新时间:2023-11-30 23:23:03 26 4
gpt4 key购买 nike

我希望从[链接]http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=1&Loc=Backlog抓取数据! ,但是MID参数在URL中是增量的,以给出第二个、第三个URL......直到1000个URL,所以我该如何处理这个(我是PYTHON和SCRAPY的新手,所以不介意我问这个)?

请检查我用来提取信息的XPATH,它没有获取任何输出,蜘蛛中是否存在基本错误

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from movie.items import MovieItem

class MySpider(BaseSpider):
name = 'movie'
allowed_domains= ["http://cbfcindia.gov.in/"]
start_urls = ["http://cbfcindia.gov.in/html/SearchDetails.aspx?mid=1&Loc=Backlog"]

def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//body") #Check
print titles
items = []
for titles in titles:
print "in FOR loop"
item = MovieItem()
item ["movie_name"]=hxs.xpath('//TABLE[@id="Table2"]/TR[2]/TD[2]/text()').extract()
print "XXXXXXXXXXXXXXXXXXXXXXXXX movie name:", item["movie_name"]
item ["movie_language"] = hxs.xpath('//*[@id="lblLanguage"]/text()').extract()
item ["movie_category"] = hxs.xpath('//*[@id="lblRegion"]/text()').extract()
item ["regional_office"] = hxs.xpath('//*[@id="lblCertNo"]/text()').extract()
item ["certificate_no"] = hxs.xpath('//*[@id="Label1"]/text()').extract()
item ["certificate_date"] = hxs.xpath('//*@id="lblCertificateLength"]/text()').extract()
item ["length"] = hxs.xpath('//*[@id="lblProducer"]/text()').extract()
item ["producer_name"] = hxs.xpath('//*[@id="lblProducer"]/text()').extract()

items.append(item)

print "this is ITEMS"
return items

以下是日志:

log>
{'certificate_date': [],
'certificate_no': [],
'length': [],
'movie_category': [],
'movie_language': [],
'movie_name': [],
'producer_name': [],
'regional_office': []}
2014-06-11 23:20:44+0530 [movie] INFO: Closing spider (finished)
214-06-11 23:20:44+0530 [movie] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 256,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 6638,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 6, 11, 17, 50, 44, 54000),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2014, 6, 11, 17, 50, 43, 681000)}

最佳答案

除了@Talvalin 的答案之外,正确的 XPath 应该采用以下形式:

item["movie_name"] = hxs.xpath("//*[@id='lblMovieName']/font/text()").extract()

由于某种原因,当页面加载时,<font>标签与 <span> 分开标签(或 id 所在的任何标签)。我已经测试过了,它有效。

不过,请注意:该网站几乎受到保护,不会被抓取。我尝试运行第二次抓取,它立即抛出 Runtime Error .

关于python - 从多个 URL 中抓取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24169630/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com