Parse XML with namespace attribute changing in Python(在Python中解析名称空间属性更改的XML)-6ren

Parse XML with namespace attribute changing in Python(在Python中解析名称空间属性更改的XML)

转载作者：bug小助手更新时间：2023-10-28 09:38:22

I am making a request to a URL and in the xml response I get, the xmlns attribute namespace changes from time to time. Hence finding an element returns None when I hardcode the namespace.

我正在向一个URL发出请求，在我得到的XML响应中，xmlns属性名称空间不时发生变化。因此，当我对命名空间进行硬编码时，查找元素将返回NONE。

For instance I get the following XML:

例如，我获得了以下XML：

<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https://github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>

Now see the xmlns attribute. The entire attribute is same but sometimes the '2012/06' part keeps changing from time to time for certain responses. I have the following python script. See the line ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}. I can't hardcode the namespace like that. Are there any alternatives like using regular expressions etc to map the namespace? Only the date part changes i.e. 2013/05 in some responses its 2012/04 etc.

现在请参阅xmlns属性。整个属性是相同的，但有时‘2012/06’部分会因某些回应而不断变化。我有以下的Python脚本。参见ns={‘nuspec’：‘http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd’}.行我不能像那样硬编码命名空间。有没有其他方法可以映射命名空间，比如使用正则表达式等？只有日期部分有变化，如2013/05年度，在一些答复中，2012/04年度等。

def fetch_nuget_spec(self, versioned_package):
        name = versioned_package.package.name.lower()
        version = versioned_package.version.lower()
        url = f'https://api.nuget.org/v3-flatcontainer/{name}/{version}/{name}.nuspec'
        response = requests.get(url)
        metadata = ET.fromstring(response.content)
        ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}
        license = metadata.find('./nuspec:metadata/nuspec:license', ns)
        if license is None:
            license_url=metadata.find('./nuspec:metadata/nuspec:licenseUrl', ns)
            if license_url is None:
                return { 'license': 'Not Found'  }
            return {'license':license_url.text}
        else:
            if len(license.text)==0:
                print('SHIT')
            return { 'license': license.text  }

更多回答

stackoverflow.com/questions/14853243/…

STACKOVERFLOW.com/Questions/14853243/…

优秀答案推荐

If using lxml is an option then it could help to list namespaces like

如果可以选择使用lxml，那么可以列出名称空间，如

from lxml import etree
doc = etree.parse("tmp.xml")
# get namespaces excluding the default 'xml'
ns = { ('nuspec' if t[0] is None else t[0]): t[1] for t in doc.xpath('/*/namespace::*[name()!="xml"]')}
print(ns)
# {'nuspec': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}

Using both lxml and xml.etree.ElementTree could mean that the document would be parsed twice so only lxml should be used if possible since it has a more complete XML and XPath implementation.

If that's not possible, ET could be used from the result of lxml parsing

同时使用lxml和xml.etree.ElementTree可能意味着文档将被解析两次，因此如果可能的话，应该只使用lxml，因为它有更完整的XML和XPath实现。如果这是不可能的，可以从lxml解析的结果使用ET

>>> tree = ET.ElementTree(doc)
>>> tree.find('./nuspec:metadata/nuspec:licenseUrl', ns)
<Element {http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd}licenseUrl at 0x7fe019ea1cc8>

xml.etree.ElementTree implementation lacks namespace axis support.

Xml.etree.ElementTree实现缺少命名空间轴支持。

Without another modul, all with xml.etree.ElementTree:

没有其他模块，全部使用xml.etree.ElementTree：

import xml.etree.ElementTree as ET

tree = ET.parse('xml_str.xml')
root = tree.getroot()

ns = dict([node for _, node in ET.iterparse('xml_str.xml', events=['start-ns'])])
print(ns)

licenseUrl = root.find(".//licenseUrl", ns).text
print("LicenseUrl: ", licenseUrl)

Output:

产出：

{'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
LicenseUrl:  https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt

Option 2, if parsing time is important:

选项2，如果解析时间很重要：


import xml.etree.ElementTree as ET

nsmap = {}
for event, node in ET.iterparse('xml_str.xml', events=['start-ns', 'end']):
    
    if event == 'start-ns':
        ns, url = node
        nsmap[ns] = url
        print(nsmap)

    if event == 'end' and node.tag == f"{{{url}}}licenseUrl":
        print(node.text)

Output:

产出：


{'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt

Don’t hardcode the namespace. With regex you can find it with:

不要对命名空间进行硬编码。使用正则表达式，您可以通过以下命令找到它：

import xml.etree.ElementTree as ET
import re

xml = """<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https:h//github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>"""

root = ET.fromstring(xml)

# Find namespace with regex
ns = re.match(r'{.*}', root.tag).group(0)
print("Namespace: ", ns)

licenseUrl = root.find(f".//{ns}licenseUrl").text
print("LicenseUrl: ", licenseUrl)

Output:

产出：

Namespace:  {http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd}
LicenseUrl:  https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt

You need to be aware that the reason they put the date in the namespace URI is that the format of the XML can change from one version to another, so if you're going to write code that works with any version, you need to make sure it is tested properly against multiple versions. (Generally people advise against versioning namespace URIs, for exactly the reasons you are seeing, but not everyone follows that advice, and that appears to include Microsoft).

您需要知道，他们将日期放在名称空间URI中的原因是，XML的格式可以从一个版本更改到另一个版本，因此，如果您要编写适用于任何版本的代码，您需要确保它针对多个版本进行了正确的测试。(通常，出于您所看到的原因，人们建议不要对命名空间URI进行版本控制，但并不是每个人都遵循该建议，这似乎包括Microsoft)。

My own preference when trying to handle multiple versions of an input document format is to insert a normalisation step into your processing pipeline: this should transform the incoming documents into a common format so that the rest of your processing doesn't need to worry about the variations. As well as changing the namespaces, this phase could handle any other differences you encounter in the formats.

在尝试处理输入文档格式的多个版本时，我自己的偏好是在您的处理管道中插入标准化步骤：这应该会将传入的文档转换为一种通用格式，这样您的处理的其余部分就不需要担心这些变化。除了更改名称空间之外，此阶段还可以处理您在格式中遇到的任何其他差异。

My other preference is to do as much of the processing as possible in XSLT, and an XSLT step that normalizes the namespace is pretty easy to write, especially if you use XSLT 3.0.

我的另一个偏好是尽可能多地使用XSLT进行处理，标准化名称空间的XSLT步骤非常容易编写，特别是在使用XSLT 3.0的情况下。

Please don't follow the advice of processing XML using regular expressions. It can only lead to tears. For example if someone posts a nuspec document containing an older namespace commented out, it's very likely to throw your processing completely.

请不要遵循使用正则表达式处理XML的建议。这只会让人流泪。例如，如果有人发布了一个包含被注释掉的较旧名称空间的nuspec文档，很可能会完全放弃您的处理。

更多回答

The drawback here is that the document is parsed twice. A mitigation could be to write a function that handles the iterparse action and returns as soon as the namespace is found.

这里的缺点是文档需要解析两次。一种缓解方法可能是编写一个函数来处理迭代式解析操作，并在找到命名空间后立即返回。

What I'm saying is that all events will be parsed until the end of the document so it should break out of the loop as soon as the namespace was found to avoid that.

我要说的是，所有事件都将被解析到文档的末尾，因此一旦找到命名空间，它就应该跳出循环以避免这种情况。

ET.iterparse() doesn't know many start-ns events it will find so it will look at all events until the end of the document unless breaking out of the loop. The difference might be negligible on small documents but significant if there are hundreds or thousands of elements. As a test, print something on all end events.

Et.iterparse()不知道它将找到的许多start-ns事件，因此它将查看直到文档结束的所有事件，除非退出循环。对于较小的文档，差异可能可以忽略不计，但如果有成百上千个元素，则差异很大。作为测试，在所有结束事件上打印一些内容。

The problem is that the file is parsed twice: ET.parse(file) and ET.iterparse(). That's not efficient and could be acceptable/ignored for small documents but bad for medium to large documents.

问题是文件被解析了两次：ET.parse(文件)和ET.iterparse()。这效率不高，对于小型文档可以接受/忽略，但对于中型到大型文档则不好。

Thanks for the idea I will try it out

谢谢你的主意，我会试试看的

That’s a very helpful hint!

这是一个非常有用的提示！

c# - if((attributes and File Attributes.Hidden) == File Attributes.Hidden) { } 如何工作？
关于 this页面，我看到以下代码: if ((attributes & FileAttributes.Hidden) == FileAttributes.Hidden) 但我不明白为什么会变成这样。
attributes - pthread互斥锁的 “attribute”是什么？
函数pthread_mutex_init允许您指定指向属性的指针。但是我还没有找到关于pthread属性是什么的很好的解释。我一直只是提供NULL。这个论点有用吗？该文档，对于那些忘记它的人: PT
xml - 我怎样才能结合xsl :attribute and xsl:use-attribute-sets to conditionally use an attribute set?
我们有一个 xml 节点“item”，其属性为“style”，即“Header1”。但是，这种风格可以改变。我们有一个名为 Header1 的属性集，它定义了它在 PDF 中的外观，通过 xsl:fo
JavaScript: element.setAttribute(attribute,value) , element.attribute=value & element.[attribute]=value 不改变属性值
我的任务是在用户点击它时从输入框中删除占位符并使标签可见。如果用户未在其中再次填写任何内容，请放回占位符并使标签不可见。我可以隐藏它但不能重新分配它。我试过 element.setAttribute
attributes - ASP.NET 5 : Bind attribute with Include parameter - include is not a valid named attribute argument
我从文章中编写代码，并且有: public IActionResult Create([Bind(Include="Imie,Nazwisko,Stanowisko,Wiek")] Pracownik
attributes - 单点触控 : Understand Foundation Attributes
你能给我解释一下以下属性吗？ 1) [MonoTouch.Foundation.Register("SomeClass")] 这个属性是否只用于向IB注册类？以编程方式扩展 iOS 类时是否必须使用此
c++ - this.attribute 应该是 this->attribute 是什么意思
我正在编写一个 C++ 程序，在调试时我在执行以下函数: int CClass::do_something() { ... // I've put a breakpoint here } 我的 C
javascript - polymer 1.0 : Is there any way to use 'layout' as an attribute instead of as a CSS class or using Attribute serialization in the class attribute?
我已经在 polymer 0.5 中构建了我的应用程序。现在我已经将它更新到 polymer 1.0。对于响应式布局，我使用了一个布局属性，它使用 Polymer 0.5 中布局属性的自定义逻辑。
attributes - Jade : element attributes without value
我是使用 Jade 的新手——到目前为止它很棒。但是我需要发生的一件事是具有“itemscope”属性的元素: 我的 Jade 符是: header(itemscope, itemtype='ht
attributes - 为什么在 Chef 中使用普通属性(attribute.set[..])？
我正在研究一个厨师实现，有时在过去的地方使用了 attribute.set，attribute.default 会这样做。为了解决这个问题，我对 Chef 属性优先范式非常熟悉。我知道“正常”属性(使
HTML "data-attribute"与简单 "custom attribute"
我经常看到html data-attribute (s) 将特定值/参数添加到 html 元素，例如使用它们将按钮“链接”到要打开的模式对话框等的 Bootstrap。现在，我看到一个几乎著名的
ruby - self.attribute 与 @attribute 的优势？
假设如下: def create_new_salt self.salt = self.object_id.to_s + rand.to_s end 为什么使用“ self ”更好。而不是实例变量“
主干.js 访问模型中的模型属性 - this.attribute VS this.get ('attribute' )？
根据我的理解，Backbone.js 模型的属性应该通过以下方式声明为有点私有(private)的成员变量 this.set({ attributeName: attributeValue }) //
xml - 在Hive XML SerDe中使用 “Attribute to Attribute”映射
我有一个看起来像下面的XML文档: ... ... ... ...
JSF 复合 :attribute with f:attribute conversion error
我正在实现一个 JSF 组件，需要有条件地添加一些属性。这个问题类似于之前的 JSF: p:dataTable with f:attribute results in "argument type m
安卓市场发布: 'android:icon' attribute: attribute is not a string value
我正在尝试将应用程序发布到 Android 电子市场，但出现以下错误: W/ResourceType(16964): No known package when getting value for r
c++ - 玛雅编程 : Separating attributes into sections in the attribute editor
抱歉这么具体的应用程序，但我注意到另一篇关于 Maya 开发的回答很好的帖子。我刚刚为 Maya 编写了一个插件节点。它只是根据湍流函数杀死一堆粒子。湍流由许多可在属性编辑器中调整的属性驱动。在属
html - html元素中data-attribute=false与data-attribute ="false"有什么区别吗？
我在 html 元素中的数据属性为 Update .它具有数据属性的 bool 值。跟下面的元素Update有什么区别吗？因为数据属性用双引号引起来。 html是否支持 bool 值？最佳答案 b
c# - 错误 : "is not an attribute class" when using ConfigurationElementType attribute
我正在尝试为企业库 5.0 的异常处理 block 创建自定义异常处理程序。据我了解，我需要使用属性开始上课“[ConfigurationElementType(typeof(CustomHandle
css - [attribute~=value] 和 [attribute*=value] 的区别
我找不到这两个选择器之间的区别。两者似乎都做同样的事情，即根据包含给定字符串的特定属性值选择标签。对于 [attribute~=value] :http://www.w3schools.com/cs

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Parse XML with namespace attribute changing in Python(在Python中解析名称空间属性更改的XML)