gpt4 book ai didi

xpath - 如何创建自定义 xpath 查询?

转载 作者:行者123 更新时间:2023-12-03 17:14:57 24 4
gpt4 key购买 nike

这是我的 HTML 文件数据:

<article class='course-box'>
<div class='row-fluid'>
<div class='span2'>
<div class='course-cover' style='width: 100%'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle'>
<a href='https://novoed.com/hc'>Hippocrates Challenge</a>
</h2>
<figure class='pricetag'>
Free
</figure>
<div class='timeline independent-text'>
<div class='timeline inline-block'>
Starting Spring 2014
</div>

</div>
By Jill Helms
<div class='university' style='margin-top:0px; font-style:normal;'>
Stanford University
</div>
</div>
</div>
<div class='hovered row-fluid' onclick="location.href='https://novoed.com/hc'">
<div class='span2'>
<div class='course-cover'>
<img alt='' src='https://d2d6mu5qcvgbk5.cloudfront.net/courses/cover_photos/c4f5fd2efb200e71d09014970cf0b8c86e1e7013.png?1375831955' style='width: 100%'>
</div>
</div>
<div class='span10'>
<h2 class='coursetitle' style='margin-top: 10px'>
<a href='https://novoed.com/hc'>
Hippocrates Challenge
</a>
</h2>
<p class='description' style='width: 70%'>
Hippocrates Challenge 2014 is a course designed for anyone with an interest in medicine. The course focuses on teaching anatomy in an interactive way, students will learn about diagnosis and treatment planning while...
</p>
<div style='margin-right: 10px'>
<a class='btn action-btn novoed-primary' href='https://novoed.com/users/sign_up?class=hc'>
Sign Up
</a>

</div>
</div>
</div>

从上面的代码我需要获取以下标记类值。
  • 类(class)名称
  • 类(class)标题链接
  • pircetag
  • 时间线内联 block
  • 大学
  • 说明
  • 导师姓名

  • 但是 coursetitle 在两个地方可用,但我只需要一次。相同的讲师姓名不包含任何特定的标记。

    我的 xpath 查询是:
        novoedData = HtmlXPathSelector(response)
    courseTitle = novoedData.xpath('//div[re:test(@class, "row-fluid")]/div[re:test(@class, "span10")]/h2[re:test(@class, "coursetitle")]/a/text()').extract()
    courseDetailLink = novoedData.xpath('//div[re:test(@class, "row-fluid")]/div[re:test(@class, "span10")]/h2[re:test(@class, "coursetitle")]/a/@href').extract()
    courseInstructorName = novoedData.xpath('//div[re:test(@class, "row-fluid")]/div[re:test(@class, "span10")]/text()').extract()
    coursePriceType = novoedData.xpath('//div[re:test(@class, "row-fluid")]/div[re:test(@class, "span10")]/figure[re:test(@class, "pricetag")]/text()').extract()
    courseShortSummary = novoedData.xpath('//div[re:test(@class, "hovered row-fluid")]/div[re:test(@class, "span10")]/p[re:test(@class, "description")]/text()').extract()
    courseUniversity = novoedData.xpath('//div[re:test(@class, "row-fluid")]/div[re:test(@class, "span10")]/div[re:test(@class, "university")]/text()').extract()

    但是每个列表变量中的值的数量是不同的:
    len(courseTitle) = 40 (two times because of repetition)
    len(courseDetailLink) = 40 (two times because of repetition)
    len(courseInstructorName) = 160 (some unwanted character is coming because no specific tag for this value)
    len(coursePriceType) = 20 (correct count no repetition)
    len(courseShortSummary)= 20 (correct count no repetition)
    len(courseUniversity) = 20 (correct count no repetition)

    请修改我的 xpath 查询以解决我的问题。提前致谢..

    最佳答案

    你不需要那个re:test ,只需执行以下操作:

    >>> s = sel.xpath('//div[@class="row-fluid"]/div[@class="span10"]')[0]
    >>> len(s)
    1
    >>> s.xpath('h2[@class="coursetitle"]/a/@href').extract()
    [u'https://novoed.com/hc']

    另请注意,一旦 s设置在正确的位置,您可以从它继续。

    关于xpath - 如何创建自定义 xpath 查询?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21730490/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com