gpt4 book ai didi

python - 使用 scrapy 从 gsmarena 页面中提取数据

转载 作者:太空宇宙 更新时间:2023-11-04 05:50:04 26 4
gpt4 key购买 nike

我正在尝试从 gsmarena 页面下载数据:“http://www.gsmarena.com/htc_one_me-7275.php ”。

但是数据是以表格和表格行的形式分类的。数据格式为:

table header > td[@class='ttl'] > td[@class='nfo']

编辑代码:感谢 stackexchange 社区成员的帮助,我将代码重新格式化为:Items.py 文件:

import scrapy

class gsmArenaDataItem(scrapy.Item):
phoneName = scrapy.Field()
phoneDetails = scrapy.Field()
pass

蜘蛛文件:

from scrapy.selector import Selector
from scrapy import Spider
from gsmarena_data.items import gsmArenaDataItem

class testSpider(Spider):
name = "mobile_test"
allowed_domains = ["gsmarena.com"]
start_urls = ('http://www.gsmarena.com/htc_one_me-7275.php',)

def parse(self, response):
# extract whatever stuffs you want and yield items here
hxs = Selector(response)
phone = gsmArenaDataItem()
tableRows = hxs.css("div#specs-list table")
for tableRows in tableRows:
phone['phoneName'] = tableRows.xpath(".//th/text()").extract()[0]
for ttl in tableRows.xpath(".//td[@class='ttl']"):
ttl_value = " ".join(ttl.xpath(".//text()").extract())
nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())
colonSign = ": "
commaSign = ", "
seq = [ttl_value, colonSign, nfo_value, commaSign]
phone['phoneDetails'] = "".join(seq)
yield phone

但是,一旦我尝试使用 scrapy shell 加载页面,我就会被禁止:

"http://www.gsmarena.com/htc_one_me-7275.php"

我什至尝试在 settings.py 中使用 DOWNLOAD_DELAY = 3。

请建议我应该怎么做。

最佳答案

想法是遍历“spec-list”中的所有table元素,获取 block 名称的th元素,获取所有具有 class="ttl" 的 td 元素以及具有 class="nfo" 的相应后续 td 兄弟元素。

来自 shell 的演示:

In [1]: for scope in response.css("div#specs-list table"):
scope_name = scope.xpath(".//th/text()").extract()[0]

for ttl in scope.xpath(".//td[@class='ttl']"):
ttl_value = " ".join(ttl.xpath(".//text()").extract())
nfo_value = " ".join(ttl.xpath("following-sibling::td[@class='nfo']//text()").extract())

print scope_name, ttl_value, nfo_value
....:
Network Technology GSM / HSPA / LTE
Network 2G bands GSM 850 / 900 / 1800 / 1900 - SIM 1 & SIM 2
...
Battery Stand-by Up to 598 h (2G) / Up to 626 h (3G)
Battery Talk time Up to 23 h (2G) / Up to 13 h (3G)
Misc Colors Meteor Grey, Rose Gold, Gold Sepia

关于python - 使用 scrapy 从 gsmarena 页面中提取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30673602/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com