gpt4 book ai didi

Python scrapy提取特定Xpath字段

转载 作者:行者123 更新时间:2023-11-30 23:16:08 28 4
gpt4 key购买 nike

我有以下结构(示例)。我正在使用 scrapy 来提取详细信息。我需要提取“href”字段和“会计”等文本。我正在使用以下代码。我是 Xpath 的新手。任何提取特定字段的帮助。

<div class = 'something'>
<ul>
<li><a href="http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="1">Accounting</a></li>

<li><a href="http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="2">Administrative</a></li>

<li><a href="http://jobsearch.about.com/od/job-titles/a/advertising-job-titles.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="3">Advertising</a></li>

<li><a href="http://jobsearch.about.com/od/job-title-samples/fl/airline-industry-jobs.htm" data-component="link" data-source="inlineLink" data-type="internalLink" data-ordinal="4">Airline</a></li>
</ul>
</div>

我的代码是:

from scrapy.spider import BaseSpider

from jobfetch.items import JobfetchItem

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose


class JobFetchSpider(BaseSpider):
"""Spider for regularly updated livingsocial.com site, San Francisco Page"""
name = "Jobsearch"
allowed_domains = ["jobsearch.about.com/"]
start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']

def parse(self, response):
count = 0
for sel in response.xpath('//*[@id="main"]/div/div[2]/div[1]/div/div[2]/article/div[2]/ul[1]'):
item = JobfetchItem()
item['title'] = sel.extract()
item['link'] = sel.extract()
count = count+1
print item

yield item

最佳答案

您在代码中遇到的问题:

  • yield item 应该位于循环内,因为您正在那里实例化项目
  • 您拥有的 xpath 非常困惑且不太可靠,因为它严重依赖于父标记内的元素位置,并且几乎从文档的顶部父级开始
  • 您的 xpath 不正确 - 它应该转到 ulli 内的 a 元素
  • sel.extract() 只会为您提供提取的 ul 元素

作为示例,请在此处使用 CSS 选择器 获取 li 标记:

import scrapy

from jobfetch.items import JobfetchItem


class JobFetchSpider(scrapy.Spider):
name = "Jobsearch"
allowed_domains = ["jobsearch.about.com/"]
start_urls = ['http://jobsearch.about.com/od/job-titles/fl/job-titles-a-z.htm']

def parse(self, response):
for sel in response.css('article[itemprop="articleBody"] div.expert-content-text > ul > li > a'):
item = JobfetchItem()
item['title'] = sel.xpath('text()').extract()[0]
item['link'] = sel.xpath('@href').extract()[0]
yield item

运行蜘蛛会产生:

{'link': u'http://jobsearch.about.com/od/job-title-samples/a/accounting-job-titles.htm', 'title': u'Accounting'}
{'link': u'http://jobsearch.about.com/od/job-title-samples/a/admin-job-titles.htm', 'title': u'Administrative'}
...
{'link': u'http://jobsearch.about.com/od/job-title-samples/fl/yacht-job-titles.htm', 'title': u'Yacht Jobs'}
<小时/>

仅供引用,我们也可以使用xpath():

//article[@itemprop="articleBody"]//div[@class="expert-content-text"]/ul/li/a

关于Python scrapy提取特定Xpath字段,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27854486/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com