gpt4 book ai didi

Python 使用 lxml 遍历节

转载 作者:行者123 更新时间:2023-12-01 05:50:05 25 4
gpt4 key购买 nike

我目前正在使用 BeautifulSoup 解析一个网页,但它非常慢,所以我决定尝试 lxml,因为我读到它非常快。

无论如何,我正在努力让我的代码迭代我想要的部分,不确定如何使用 lxml 并且我找不到关于它的明确文档。

无论如何,这是我的代码:

import urllib, urllib2
from lxml import etree

def wgetUrl(target):
try:
req = urllib2.Request(target)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
outtxt = response.read()
response.close()
except:
return ''
return outtxt

newUrl = 'http://www.tv3.ie/3player'

data = wgetUrl(newUrl)
parser = etree.HTMLParser()
tree = etree.fromstring(data, parser)

for elem in tree.iter("div"):
print elem.tag, elem.attrib, elem.text

这会返回所有 DIV,但如何指定仅迭代 dev id='slider1'?

div {'style': 'position: relative;', 'id': 'slider1'} None

这不起作用:

for elem in tree.iter("slider1"):

我知道这可能是一个愚蠢的问题,但我无法弄清楚..

谢谢!

* 编辑**

在您添加此代码的帮助下,我现在得到以下输出:

for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow']"):
print elem[0].tag, elem[0].attrib, elem[0].text
print elem[1].tag, elem[1].attrib, elem[1].text
print elem[2].tag, elem[2].attrib, elem[2].text
print elem[3].tag, elem[3].attrib, elem[3].text
print elem[4].tag, elem[4].attrib, elem[4].text

输出:

a {'href': '/3player/show/392/57922/1/Tallafornia', 'title': '3player | Tallafornia, 11/01/2013. The Tallafornia crew are back, living in a beachside villa in Santa Ponsa, Majorca. As the crew settle in, the egos grow bigger than ever and cause tension'} None
h3 {} None
span {'id': 'gridcaption'} The Tallafornia crew are back, living in a beachside vill...
span {'id': 'griddate'} 11/01/2013
span {'id': 'gridduration'} 00:27:52

这一切都很棒,但我缺少上面 a 标签的一部分。解析器是否无法正确处理代码?

我没有收到以下信息:

<img alt="3player | Tallafornia, 11/01/2013. The Tallafornia crew are back, living in a beachside villa in Santa Ponsa, Majorca. As the crew settle in, the egos grow bigger than ever and cause tension" src='http://content.tv3.ie/content/videos/0378/tallaforniaep2_fri11jan2013_3player_1_57922_180x102.jpg' class='shadow smallroundcorner'></img>

有什么想法为什么它不拉这个吗?

再次感谢,非常有帮助的帖子..

最佳答案

您可以按如下方式使用 XPath 表达式:

for elem in tree.xpath("//div[@id='slider1']"):

示例:

>>> import urllib2
>>> import lxml.etree
>>> url = 'http://www.tv3.ie/3player'
>>> data = urllib2.urlopen(url)
>>> parser = lxml.etree.HTMLParser()
>>> tree = lxml.etree.parse(data,parser)
>>> elem = tree.xpath("//div[@id='slider1']")
>>> elem[0].attrib
{'style': 'position: relative;', 'id': 'slider1'}
<小时/>

您需要更好地分析正在处理的页面内容(一个好方法是使用带有 Firebug 插件的 Firefox)。

<img>您试图获取的标签实际上是 <a> 的子标签标签:

>>> for elem in tree.xpath("//div[@id='slider1']//div[@id='gridshow']"):
... for elem_a in elem.xpath("./a"):
... for elem_img in elem_a.xpath("./img"):
... print '<A> HREF=%s'%(elem_a.attrib['href'])
... print '<IMG> ALT="%s"'%(elem_img.attrib['alt'])
<A> HREF=/3player/show/392/58784/1/Tallafornia
<IMG> ALT="3player | Tallafornia, 01/02/2013. A fresh romance blossoms in the Tallafornia house. Marc challenges Cormac to a 'bench off' in the gym"
<A> HREF=/3player/show/46/58765/1/Coronation-Street
<IMG> ALT="3player | Coronation Street, 01/02/2013. Tyrone bumps into Kirsty in the street and tries to take Ruby from her pram"
../..

关于Python 使用 lxml 遍历节,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14654417/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com