gpt4 book ai didi

Parse XML with namespace attribute changing in Python(在Python中解析名称空间属性更改的XML)

转载 作者:bug小助手 更新时间:2023-10-28 09:47:09 26 4
gpt4 key购买 nike

I am making a request to a URL and in the xml response I get, the xmlns attribute namespace changes from time to time. Hence finding an element returns None when I hardcode the namespace.


For instance I get the following XML:


<package xmlns="">
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>

Now see the xmlns attribute. The entire attribute is same but sometimes the '2012/06' part keeps changing from time to time for certain responses. I have the following python script. See the line ns = {'nuspec': ''}. I can't hardcode the namespace like that. Are there any alternatives like using regular expressions etc to map the namespace? Only the date part changes i.e. 2013/05 in some responses its 2012/04 etc.


def fetch_nuget_spec(self, versioned_package):
name =
version = versioned_package.version.lower()
url = f'{name}/{version}/{name}.nuspec'
response = requests.get(url)
metadata = ET.fromstring(response.content)
ns = {'nuspec': ''}
license = metadata.find('./nuspec:metadata/nuspec:license', ns)
if license is None:
license_url=metadata.find('./nuspec:metadata/nuspec:licenseUrl', ns)
if license_url is None:
return { 'license': 'Not Found' }
return {'license':license_url.text}
if len(license.text)==0:
return { 'license': license.text }



If using lxml is an option then it could help to list namespaces like


from lxml import etree
doc = etree.parse("tmp.xml")
# get namespaces excluding the default 'xml'
ns = { ('nuspec' if t[0] is None else t[0]): t[1] for t in doc.xpath('/*/namespace::*[name()!="xml"]')}
# {'nuspec': ''}

Using both lxml and xml.etree.ElementTree could mean that the document would be parsed twice so only lxml should be used if possible since it has a more complete XML and XPath implementation.

If that's not possible, ET could be used from the result of lxml parsing


>>> tree = ET.ElementTree(doc)
>>> tree.find('./nuspec:metadata/nuspec:licenseUrl', ns)
<Element {}licenseUrl at 0x7fe019ea1cc8>

xml.etree.ElementTree implementation lacks namespace axis support.


Without another modul, all with xml.etree.ElementTree:


import xml.etree.ElementTree as ET

tree = ET.parse('xml_str.xml')
root = tree.getroot()

ns = dict([node for _, node in ET.iterparse('xml_str.xml', events=['start-ns'])])

licenseUrl = root.find(".//licenseUrl", ns).text
print("LicenseUrl: ", licenseUrl)



{'': ''}

Option 2, if parsing time is important:


import xml.etree.ElementTree as ET

nsmap = {}
for event, node in ET.iterparse('xml_str.xml', events=['start-ns', 'end']):

if event == 'start-ns':
ns, url = node
nsmap[ns] = url

if event == 'end' and node.tag == f"{{{url}}}licenseUrl":



{'': ''}

Don’t hardcode the namespace. With regex you can find it with:


import xml.etree.ElementTree as ET
import re

xml = """<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="">
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https:h//" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>

root = ET.fromstring(xml)

# Find namespace with regex
ns = re.match(r'{.*}', root.tag).group(0)
print("Namespace: ", ns)

licenseUrl = root.find(f".//{ns}licenseUrl").text
print("LicenseUrl: ", licenseUrl)



Namespace:  {}

You need to be aware that the reason they put the date in the namespace URI is that the format of the XML can change from one version to another, so if you're going to write code that works with any version, you need to make sure it is tested properly against multiple versions. (Generally people advise against versioning namespace URIs, for exactly the reasons you are seeing, but not everyone follows that advice, and that appears to include Microsoft).


My own preference when trying to handle multiple versions of an input document format is to insert a normalisation step into your processing pipeline: this should transform the incoming documents into a common format so that the rest of your processing doesn't need to worry about the variations. As well as changing the namespaces, this phase could handle any other differences you encounter in the formats.


My other preference is to do as much of the processing as possible in XSLT, and an XSLT step that normalizes the namespace is pretty easy to write, especially if you use XSLT 3.0.

我的另一个偏好是尽可能多地使用XSLT进行处理,标准化名称空间的XSLT步骤非常容易编写,特别是在使用XSLT 3.0的情况下。

Please don't follow the advice of processing XML using regular expressions. It can only lead to tears. For example if someone posts a nuspec document containing an older namespace commented out, it's very likely to throw your processing completely.



The drawback here is that the document is parsed twice. A mitigation could be to write a function that handles the iterparse action and returns as soon as the namespace is found.


What I'm saying is that all events will be parsed until the end of the document so it should break out of the loop as soon as the namespace was found to avoid that.


ET.iterparse() doesn't know many start-ns events it will find so it will look at all events until the end of the document unless breaking out of the loop. The difference might be negligible on small documents but significant if there are hundreds or thousands of elements. As a test, print something on all end events.


The problem is that the file is parsed twice: ET.parse(file) and ET.iterparse(). That's not efficient and could be acceptable/ignored for small documents but bad for medium to large documents.


Thanks for the idea I will try it out


That’s a very helpful hint!


26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号