gpt4 book ai didi

Parse XML with namespace attribute changing in Python(在Python中解析名称空间属性更改的XML)

转载 作者:bug小助手 更新时间:2023-10-28 09:47:09 26 4
gpt4 key购买 nike



I am making a request to a URL and in the xml response I get, the xmlns attribute namespace changes from time to time. Hence finding an element returns None when I hardcode the namespace.

我正在向一个URL发出请求,在我得到的XML响应中,xmlns属性名称空间不时发生变化。因此,当我对命名空间进行硬编码时,查找元素将返回NONE。


For instance I get the following XML:

例如,我获得了以下XML:


<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https://github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>

Now see the xmlns attribute. The entire attribute is same but sometimes the '2012/06' part keeps changing from time to time for certain responses. I have the following python script. See the line ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}. I can't hardcode the namespace like that. Are there any alternatives like using regular expressions etc to map the namespace? Only the date part changes i.e. 2013/05 in some responses its 2012/04 etc.

现在请参阅xmlns属性。整个属性是相同的,但有时‘2012/06’部分会因某些回应而不断变化。我有以下的Python脚本。参见ns={‘nuspec’:‘http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd’}.行我不能像那样硬编码命名空间。有没有其他方法可以映射命名空间,比如使用正则表达式等?只有日期部分有变化,如2013/05年度,在一些答复中,2012/04年度等。


def fetch_nuget_spec(self, versioned_package):
name = versioned_package.package.name.lower()
version = versioned_package.version.lower()
url = f'https://api.nuget.org/v3-flatcontainer/{name}/{version}/{name}.nuspec'
response = requests.get(url)
metadata = ET.fromstring(response.content)
ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}
license = metadata.find('./nuspec:metadata/nuspec:license', ns)
if license is None:
license_url=metadata.find('./nuspec:metadata/nuspec:licenseUrl', ns)
if license_url is None:
return { 'license': 'Not Found' }
return {'license':license_url.text}
else:
if len(license.text)==0:
print('SHIT')
return { 'license': license.text }



更多回答

stackoverflow.com/questions/14853243/…

STACKOVERFLOW.com/Questions/14853243/…

优秀答案推荐

If using lxml is an option then it could help to list namespaces like

如果可以选择使用lxml,那么可以列出名称空间,如


from lxml import etree
doc = etree.parse("tmp.xml")
# get namespaces excluding the default 'xml'
ns = { ('nuspec' if t[0] is None else t[0]): t[1] for t in doc.xpath('/*/namespace::*[name()!="xml"]')}
print(ns)
# {'nuspec': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}

Using both lxml and xml.etree.ElementTree could mean that the document would be parsed twice so only lxml should be used if possible since it has a more complete XML and XPath implementation.

If that's not possible, ET could be used from the result of lxml parsing

同时使用lxml和xml.etree.ElementTree可能意味着文档将被解析两次,因此如果可能的话,应该只使用lxml,因为它有更完整的XML和XPath实现。如果这是不可能的,可以从lxml解析的结果使用ET


>>> tree = ET.ElementTree(doc)
>>> tree.find('./nuspec:metadata/nuspec:licenseUrl', ns)
<Element {http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd}licenseUrl at 0x7fe019ea1cc8>

xml.etree.ElementTree implementation lacks namespace axis support.

Xml.etree.ElementTree实现缺少命名空间轴支持。



Without another modul, all with xml.etree.ElementTree:

没有其他模块,全部使用xml.etree.ElementTree:


import xml.etree.ElementTree as ET

tree = ET.parse('xml_str.xml')
root = tree.getroot()

ns = dict([node for _, node in ET.iterparse('xml_str.xml', events=['start-ns'])])
print(ns)

licenseUrl = root.find(".//licenseUrl", ns).text
print("LicenseUrl: ", licenseUrl)

Output:

产出:


{'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
LicenseUrl: https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt

Option 2, if parsing time is important:

选项2,如果解析时间很重要:



import xml.etree.ElementTree as ET

nsmap = {}
for event, node in ET.iterparse('xml_str.xml', events=['start-ns', 'end']):

if event == 'start-ns':
ns, url = node
nsmap[ns] = url
print(nsmap)

if event == 'end' and node.tag == f"{{{url}}}licenseUrl":
print(node.text)

Output:

输出:



{'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt


Don’t hardcode the namespace. With regex you can find it with:

不要硬编码命名空间。使用regex,你可以通过以下方式找到它:


import xml.etree.ElementTree as ET
import re

xml = """<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https:h//github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>"""

root = ET.fromstring(xml)

# Find namespace with regex
ns = re.match(r'{.*}', root.tag).group(0)
print("Namespace: ", ns)

licenseUrl = root.find(f".//{ns}licenseUrl").text
print("LicenseUrl: ", licenseUrl)

Output:

产出:


Namespace:  {http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd}
LicenseUrl: https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt


You need to be aware that the reason they put the date in the namespace URI is that the format of the XML can change from one version to another, so if you're going to write code that works with any version, you need to make sure it is tested properly against multiple versions. (Generally people advise against versioning namespace URIs, for exactly the reasons you are seeing, but not everyone follows that advice, and that appears to include Microsoft).

您需要知道,他们将日期放在名称空间URI中的原因是,XML的格式可以从一个版本更改到另一个版本,因此,如果您要编写适用于任何版本的代码,您需要确保它针对多个版本进行了正确的测试。(通常,出于您所看到的原因,人们建议不要对命名空间URI进行版本控制,但并不是每个人都遵循该建议,这似乎包括Microsoft)。


My own preference when trying to handle multiple versions of an input document format is to insert a normalisation step into your processing pipeline: this should transform the incoming documents into a common format so that the rest of your processing doesn't need to worry about the variations. As well as changing the namespaces, this phase could handle any other differences you encounter in the formats.

在尝试处理输入文档格式的多个版本时,我自己的偏好是在您的处理管道中插入标准化步骤:这应该会将传入的文档转换为一种通用格式,这样您的处理的其余部分就不需要担心这些变化。除了更改名称空间之外,此阶段还可以处理您在格式中遇到的任何其他差异。


My other preference is to do as much of the processing as possible in XSLT, and an XSLT step that normalizes the namespace is pretty easy to write, especially if you use XSLT 3.0.

我的另一个偏好是尽可能多地使用XSLT进行处理,标准化名称空间的XSLT步骤非常容易编写,特别是在使用XSLT 3.0的情况下。


Please don't follow the advice of processing XML using regular expressions. It can only lead to tears. For example if someone posts a nuspec document containing an older namespace commented out, it's very likely to throw your processing completely.

请不要遵循使用正则表达式处理XML的建议。这只会让人流泪。例如,如果有人发布了一个包含被注释掉的较旧名称空间的nuspec文档,很可能会完全放弃您的处理。


更多回答

The drawback here is that the document is parsed twice. A mitigation could be to write a function that handles the iterparse action and returns as soon as the namespace is found.

这里的缺点是文档需要解析两次。一种缓解方法可能是编写一个函数来处理迭代式解析操作,并在找到命名空间后立即返回。

What I'm saying is that all events will be parsed until the end of the document so it should break out of the loop as soon as the namespace was found to avoid that.

我要说的是,所有事件都将被解析,直到文档结束,所以它应该在找到名称空间后立即跳出循环,以避免这种情况。

ET.iterparse() doesn't know many start-ns events it will find so it will look at all events until the end of the document unless breaking out of the loop. The difference might be negligible on small documents but significant if there are hundreds or thousands of elements. As a test, print something on all end events.

Et.iterparse()不知道它将找到的许多start-ns事件,因此它将查看直到文档结束的所有事件,除非退出循环。对于较小的文档,差异可能可以忽略不计,但如果有成百上千个元素,则差异很大。作为测试,在所有结束事件上打印一些内容。

The problem is that the file is parsed twice: ET.parse(file) and ET.iterparse(). That's not efficient and could be acceptable/ignored for small documents but bad for medium to large documents.

问题是文件被解析了两次:ET.parse(文件)和ET.iterparse()。这效率不高,对于小型文档可以接受/忽略,但对于中型到大型文档则不好。

Thanks for the idea I will try it out

谢谢你的主意,我会试试看的

That’s a very helpful hint!

这是一个非常有用的提示!

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com