gpt4 book ai didi

python从xml中读取数据

转载 作者:太空宇宙 更新时间:2023-11-04 03:49:57 25 4
gpt4 key购买 nike

我正在将 scrapy 与 python 结合使用。

我正在尝试从 xml 文件中获取我的 xpath,如下所示:

def getMasterContainers(self):
containers=[]
containersFromXML = self.doc.findall('MasterPage/Containers/xpath')
for oneXpath in containersFromXML:
containers.append(oneXpath.text)
return containers

xml文件是:

<Containers>
<xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>
</Containers>

当我在 cmd 上打印结果时,我得到了这个

container = ''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''

我的问题

当我尝试 sel.xpath(self.containers[0]) 时,我没有得到任何结果,但是当我像这样在代码中写入 xpath sel.xpath('手写的xpath')得到当前数据

请帮忙。

最佳答案

更新:您确定您的问题出在这个 xpath 上吗?您是否确认它不会早于或晚于此 xpath 失败?我不太确定如何使用 scrapy 运行抓取,所以我只是手动运行 XML 解析,然后在真实文档和测试文档上运行以下命令。

first.xml 仅包含 xpath 及其父结构:

<websiteInformation>
<MasterPage>
<Containers>
<xpath>.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']</xpath>
</Containers>
</MasterPage>
</websiteInformation>

并解析first.xml:

from lxml import etree

doc = etree.parse(open('first.xml'))

containers = []
containersFromXML = doc.findall('MasterPage/Containers/xpath')
for oneXpath in containersFromXML:
print oneXpath.text
containers.append(oneXpath.text)

输出:

.//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']

看起来不错。

test.html 是:

<html>
<body>
<div id="results-list">
<div class="item paid-featured-item">
<div class="listing-item">Found A</div>
</div>
<div class="item paid-featured-item">
<div class="listing-item">Found B</div>
</div>
</div>
</body>
</html>

并通过以下方式搜索:

from scrapy.selector import Selector

sel = Selector(text=open('test.html').read())
for container in containers:
print "Xpath: {}".format(container)
result = sel.xpath(container)
print "Container: {}".format(len(result))
for elem in result:
print elem

输出:

Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
Container: 2
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found A</div>'>
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">Found B</div>'>

搜索wget得到的真实URL结果输出:

Xpath: .//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']
Container: 25
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n \n '>
# omitted 23
<Selector xpath=".//div[@id='results-list']/div[@class='item paid-featured-item']/div[@class='listing-item']" data=u'<div class="listing-item">\n \n '>

看起来您的 xpath 字符串在不应该出现的地方有额外的单引号 ( ' )。在 XML 中它看起来像:

<xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>

解析时将(如打印时所示):

''.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]''

你不想要周围的'秒。这是应该的:

.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]

如果您可以编辑包含您的 xpath 的 XML 文件,请删除前导 '&apos;和尾随 &apos;'来自每个 <xpath> .所以:

<Containers>
<xpath>'&apos;.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]&apos;'</xpath>
</Containers>

应该变成:

<Containers>
<xpath>.//div[@id="results-list"]/div[@class="item paid-featured-item"]/div[@class="listing-item"]</xpath>
</Containers>

但是如果由于某种原因你不能编辑 XML 文件,在你得到 xpath 文本后,去掉周围的 '秒。所以:

containers.append(oneXpath.text)

应该变成:

containers.append(oneXpath.text.strip("'"))

关于python从xml中读取数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21790412/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com