gpt4 book ai didi

python - Scrapy从表中的链接获取数据

转载 作者:太空宇宙 更新时间:2023-11-03 16:35:29 26 4
gpt4 key购买 nike

我正在尝试从 html 表中抓取数据,Texas Death Row

我能够使用下面的蜘蛛脚本从表中提取现有数据:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from texasdeath.items import DeathItem

class DeathSpider(BaseSpider):
name = "death"
allowed_domains = ["tdcj.state.tx.us"]
start_urls = [
"https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
]



def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//table/tbody/tr')
for site in sites:
item = DeathItem()
item['firstName'] = site.select('td[5]/text()').extract()
item['lastName'] = site.select('td[4]/text()').extract()
item['Age'] = site.select('td[7]/text()').extract()
item['Date'] = site.select('td[8]/text()').extract()
item['Race'] = site.select('td[9]/text()').extract()
item['County'] = site.select('td[10]/text()').extract()
yield item

问题是表中还有我试图调用的链接,并从链接中获取要附加到我的项目的数据。

这里是 Scrapy 教程,Scrapy Tutorial似乎有关于如何从目录中提取数据的指南。但我无法弄清楚如何从主页获取数据以及如何从表中的链接返回数据。

最佳答案

不是生成一个项目,而是生成一个Request并在meta内传递item。文档 here 对此进行了介绍。 .

蜘蛛的示例实现,如果它引导至罪犯“详细信息”页面,则该蜘蛛将跟随“罪犯信息”链接(有时它会引导至图像 - 在这种情况下,蜘蛛将输出当前拥有的内容):

from urlparse import urljoin

import scrapy


class DeathItem(scrapy.Item):
firstName = scrapy.Field()
lastName = scrapy.Field()
Age = scrapy.Field()
Date = scrapy.Field()
Race = scrapy.Field()
County = scrapy.Field()
Gender = scrapy.Field()


class DeathSpider(scrapy.Spider):
name = "death"
allowed_domains = ["tdcj.state.tx.us"]
start_urls = [
"https://www.tdcj.state.tx.us/death_row/dr_executed_offenders.html"
]

def parse(self, response):
sites = response.xpath('//table/tbody/tr')
for site in sites:
item = DeathItem()

item['firstName'] = site.xpath('td[5]/text()').extract()
item['lastName'] = site.xpath('td[4]/text()').extract()
item['Age'] = site.xpath('td[7]/text()').extract()
item['Date'] = site.xpath('td[8]/text()').extract()
item['Race'] = site.xpath('td[9]/text()').extract()
item['County'] = site.xpath('td[10]/text()').extract()

url = urljoin(response.url, site.xpath("td[2]/a/@href").extract_first())
if url.endswith("html"):
yield scrapy.Request(url, meta={"item": item}, callback=self.parse_details)
else:
yield item

def parse_details(self, response):
item = response.meta["item"]
item["Gender"] = response.xpath("//td[. = 'Gender']/following-sibling::td[1]/text()").extract()
yield item

关于python - Scrapy从表中的链接获取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37257870/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com