Parse XML with namespace attribute changing in Python(在Python中解析名称空间属性更改的XML)-6ren

Parse XML with namespace attribute changing in Python(在Python中解析名称空间属性更改的XML)

转载作者：bug小助手更新时间：2023-10-28 09:47:09

I am making a request to a URL and in the xml response I get, the xmlns attribute namespace changes from time to time. Hence finding an element returns None when I hardcode the namespace.

我正在向一个URL发出请求，在我得到的XML响应中，xmlns属性名称空间不时发生变化。因此，当我对命名空间进行硬编码时，查找元素将返回NONE。

For instance I get the following XML:

例如，我获得了以下XML：

<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https://github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>

Now see the xmlns attribute. The entire attribute is same but sometimes the '2012/06' part keeps changing from time to time for certain responses. I have the following python script. See the line ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}. I can't hardcode the namespace like that. Are there any alternatives like using regular expressions etc to map the namespace? Only the date part changes i.e. 2013/05 in some responses its 2012/04 etc.

现在请参阅xmlns属性。整个属性是相同的，但有时‘2012/06’部分会因某些回应而不断变化。我有以下的Python脚本。参见ns={‘nuspec’：‘http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd’}.行我不能像那样硬编码命名空间。有没有其他方法可以映射命名空间，比如使用正则表达式等？只有日期部分有变化，如2013/05年度，在一些答复中，2012/04年度等。

def fetch_nuget_spec(self, versioned_package):
        name = versioned_package.package.name.lower()
        version = versioned_package.version.lower()
        url = f'https://api.nuget.org/v3-flatcontainer/{name}/{version}/{name}.nuspec'
        response = requests.get(url)
        metadata = ET.fromstring(response.content)
        ns = {'nuspec': 'http://schemas.microsoft.com/packaging/2013/05/nuspec.xsd'}
        license = metadata.find('./nuspec:metadata/nuspec:license', ns)
        if license is None:
            license_url=metadata.find('./nuspec:metadata/nuspec:licenseUrl', ns)
            if license_url is None:
                return { 'license': 'Not Found'  }
            return {'license':license_url.text}
        else:
            if len(license.text)==0:
                print('SHIT')
            return { 'license': license.text  }

更多回答

stackoverflow.com/questions/14853243/…

STACKOVERFLOW.com/Questions/14853243/…

优秀答案推荐

If using lxml is an option then it could help to list namespaces like

如果可以选择使用lxml，那么可以列出名称空间，如

from lxml import etree
doc = etree.parse("tmp.xml")
# get namespaces excluding the default 'xml'
ns = { ('nuspec' if t[0] is None else t[0]): t[1] for t in doc.xpath('/*/namespace::*[name()!="xml"]')}
print(ns)
# {'nuspec': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}

Using both lxml and xml.etree.ElementTree could mean that the document would be parsed twice so only lxml should be used if possible since it has a more complete XML and XPath implementation.

If that's not possible, ET could be used from the result of lxml parsing

同时使用lxml和xml.etree.ElementTree可能意味着文档将被解析两次，因此如果可能的话，应该只使用lxml，因为它有更完整的XML和XPath实现。如果这是不可能的，可以从lxml解析的结果使用ET

>>> tree = ET.ElementTree(doc)
>>> tree.find('./nuspec:metadata/nuspec:licenseUrl', ns)
<Element {http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd}licenseUrl at 0x7fe019ea1cc8>

xml.etree.ElementTree implementation lacks namespace axis support.

Xml.etree.ElementTree实现缺少命名空间轴支持。

Without another modul, all with xml.etree.ElementTree:

没有其他模块，全部使用xml.etree.ElementTree：

import xml.etree.ElementTree as ET

tree = ET.parse('xml_str.xml')
root = tree.getroot()

ns = dict([node for _, node in ET.iterparse('xml_str.xml', events=['start-ns'])])
print(ns)

licenseUrl = root.find(".//licenseUrl", ns).text
print("LicenseUrl: ", licenseUrl)

Output:

产出：

{'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
LicenseUrl:  https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt

Option 2, if parsing time is important:

选项2，如果解析时间很重要：


import xml.etree.ElementTree as ET

nsmap = {}
for event, node in ET.iterparse('xml_str.xml', events=['start-ns', 'end']):
    
    if event == 'start-ns':
        ns, url = node
        nsmap[ns] = url
        print(nsmap)

    if event == 'end' and node.tag == f"{{{url}}}licenseUrl":
        print(node.text)

Output:

输出：


{'': 'http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd'}
https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt

Don’t hardcode the namespace. With regex you can find it with:

不要硬编码命名空间。使用regex，你可以通过以下方式找到它：

import xml.etree.ElementTree as ET
import re

xml = """<?xml version="1.0" encoding="UTF-8"?>
<package xmlns="http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd">
<metadata>
<id>SharpZipLib</id>
<version>1.1.0</version>
<authors>ICSharpCode</authors>
<owners>ICSharpCode</owners>
<requireLicenseAcceptance>false</requireLicenseAcceptance>
<licenseUrl>https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt</licenseUrl>
<projectUrl>https://github.com/icsharpcode/SharpZipLib</projectUrl>
<description>SharpZipLib (#ziplib, formerly NZipLib) is a compression library for Zip, GZip, BZip2, and Tar written entirely in C# for .NET. It is implemented as an assembly (installable in the GAC), and thus can easily be incorporated into other projects (in any .NET language)</description>
<releaseNotes>Please see https://github.com/icsharpcode/SharpZipLib/wiki/Release-1.1 for more information.</releaseNotes>
<copyright>Copyright © 2000-2018 SharpZipLib Contributors</copyright>
<tags>Compression Library Zip GZip BZip2 LZW Tar</tags>
<repository type="git" url="https:h//github.com/icsharpcode/SharpZipLib" commit="45347c34a0752f188ae742e9e295a22de6b2c2ed"/>
<dependencies>
<group targetFramework=".NETFramework4.5"/>
<group targetFramework=".NETStandard2.0"/>
</dependencies>
</metadata>
</package>"""

root = ET.fromstring(xml)

# Find namespace with regex
ns = re.match(r'{.*}', root.tag).group(0)
print("Namespace: ", ns)

licenseUrl = root.find(f".//{ns}licenseUrl").text
print("LicenseUrl: ", licenseUrl)

Output:

产出：

Namespace:  {http://schemas.microsoft.com/packaging/2012/06/nuspec.xsd}
LicenseUrl:  https://github.com/icsharpcode/SharpZipLib/blob/master/LICENSE.txt

You need to be aware that the reason they put the date in the namespace URI is that the format of the XML can change from one version to another, so if you're going to write code that works with any version, you need to make sure it is tested properly against multiple versions. (Generally people advise against versioning namespace URIs, for exactly the reasons you are seeing, but not everyone follows that advice, and that appears to include Microsoft).

您需要知道，他们将日期放在名称空间URI中的原因是，XML的格式可以从一个版本更改到另一个版本，因此，如果您要编写适用于任何版本的代码，您需要确保它针对多个版本进行了正确的测试。(通常，出于您所看到的原因，人们建议不要对命名空间URI进行版本控制，但并不是每个人都遵循该建议，这似乎包括Microsoft)。

My own preference when trying to handle multiple versions of an input document format is to insert a normalisation step into your processing pipeline: this should transform the incoming documents into a common format so that the rest of your processing doesn't need to worry about the variations. As well as changing the namespaces, this phase could handle any other differences you encounter in the formats.

在尝试处理输入文档格式的多个版本时，我自己的偏好是在您的处理管道中插入标准化步骤：这应该会将传入的文档转换为一种通用格式，这样您的处理的其余部分就不需要担心这些变化。除了更改名称空间之外，此阶段还可以处理您在格式中遇到的任何其他差异。

My other preference is to do as much of the processing as possible in XSLT, and an XSLT step that normalizes the namespace is pretty easy to write, especially if you use XSLT 3.0.

我的另一个偏好是尽可能多地使用XSLT进行处理，标准化名称空间的XSLT步骤非常容易编写，特别是在使用XSLT 3.0的情况下。

Please don't follow the advice of processing XML using regular expressions. It can only lead to tears. For example if someone posts a nuspec document containing an older namespace commented out, it's very likely to throw your processing completely.

请不要遵循使用正则表达式处理XML的建议。这只会让人流泪。例如，如果有人发布了一个包含被注释掉的较旧名称空间的nuspec文档，很可能会完全放弃您的处理。

更多回答

The drawback here is that the document is parsed twice. A mitigation could be to write a function that handles the iterparse action and returns as soon as the namespace is found.

这里的缺点是文档需要解析两次。一种缓解方法可能是编写一个函数来处理迭代式解析操作，并在找到命名空间后立即返回。

What I'm saying is that all events will be parsed until the end of the document so it should break out of the loop as soon as the namespace was found to avoid that.

我要说的是，所有事件都将被解析，直到文档结束，所以它应该在找到名称空间后立即跳出循环，以避免这种情况。

ET.iterparse() doesn't know many start-ns events it will find so it will look at all events until the end of the document unless breaking out of the loop. The difference might be negligible on small documents but significant if there are hundreds or thousands of elements. As a test, print something on all end events.

Et.iterparse()不知道它将找到的许多start-ns事件，因此它将查看直到文档结束的所有事件，除非退出循环。对于较小的文档，差异可能可以忽略不计，但如果有成百上千个元素，则差异很大。作为测试，在所有结束事件上打印一些内容。

The problem is that the file is parsed twice: ET.parse(file) and ET.iterparse(). That's not efficient and could be acceptable/ignored for small documents but bad for medium to large documents.

问题是文件被解析了两次：ET.parse(文件)和ET.iterparse()。这效率不高，对于小型文档可以接受/忽略，但对于中型到大型文档则不好。

Thanks for the idea I will try it out

谢谢你的主意，我会试试看的

That’s a very helpful hint!

这是一个非常有用的提示！

javascript - 控制台错误 - 解析 AJAX JSON 解析
我一直在使用 AJAX 从我正在创建的网络服务中解析 JSON 数组时遇到问题。我的前端是一个简单的 ajax 和 jquery 组合，用于显示从我正在创建的网络服务返回的结果。尽管知道我的数据库查
xml - Json 解析 vs xml 解析？
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
android - java.lang.NoClassDefFoundError : com. 解析。解析
我在尝试运行 Android 应用程序时遇到问题并收到以下错误 java.lang.NoClassDefFoundError: com.parse.Parse 当我尝试运行该应用时。最佳答案在这
python - 解析 HTML 内容时防止 etree 解析 HTML 实体
有什么办法可以防止etree在解析HTML内容时解析HTML实体吗？ html = etree.HTML('&') html.find('.//body').text 这给了我 '&' 但我想
javascript - 使用 JSON 解析/解析 js 对象时，返回方法中的函数范围会丢失
我有一个有点疯狂的例子，但对于那些 JavaScript 函数作用域专家来说，它看起来是一个很好的练习: (function (global) { // our module number one
java - 使用 Java 解析 HTML 数据(DOM 解析)
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 8 年前。 Improve th
php - 在服务器上用 PHP 解析 HTML 还是在最终用户端用 JavaScript 解析 HTML 会更好？
我需要编写一个脚本来获取链接并解析链接页面的 HTML 以提取标题和其他一些数据，例如可能是简短的描述，就像您链接到 Facebook 上的内容一样。当用户向站点添加链接时将调用它，因此在客户端启动
node.js - 为什么 npm 包从/AppData 解析，而不是从 local/node_modules 解析？
在 VS Code 中本地开发时，包解析为 C:/Users//AppData/Local/Microsoft/TypeScript/3.5/node_modules/@types//index而不是
php - 解析 json 错误 : SyntaxError: JSON. 解析:JSON 数据的第 1 行第 2 列出现意外字符
我在将 json 从 php 解析为 javascript 时遇到问题这是我的示例代码: //function MethodAjax = function (wsFile, param) {
php - 解析 json 错误 : SyntaxError: JSON. 解析:JSON 数据的第 1 行第 2 列出现意外字符
我在将 json 从 php 解析为 javascript 时遇到问题这是我的示例代码: //function MethodAjax = function (wsFile, param) {
解析，在哪里可以了解
我被赋予了将一种语言“翻译”成另一种语言的工作。对于使用正则表达式的简单逐行方法来说，源代码过于灵活(复杂)。我在哪里可以了解更多关于词法分析和解析器的信息？最佳答案如果你想对这个主题产生“情绪化
正则表达式 {} 解析
您好，我在解析此文本时遇到问题 { { { {[system1];1;1;0.612509325}; {[system2];1;
JavaScript 解析？
我正在为 adobe after effects 在 extendscript 中编写一些代码，最终变成了 javascript。我有一个数组，我想只搜索单词“assemble”并返回整个 jc3_
JavaScript 解析
我有这段代码: $(document).ready(function() { // }); 问题:FB_RequireFeatures block 外部的代码先于其内部的代码执行。因此 who
解析.netcore项目中IStartupFilter使用教程
背景： netcore项目中有些服务是在通过中间件来通信的，比如orleans组件。它里面服务和客户端会指定网关和端口，我们只需要开放客户端给外界，服务端关闭端口。相当于去掉host，这样省掉了些
解析:继承ViewGroup后的子类如何重写onMeasure方法
1.首先贴上我试验成功的代码复制代码代码如下: protected void onMeasure(int widthMeasureSpec, int heightMeasureSpec)
Python如何对XML 解析
什么是 XML？ XML 指可扩展标记语言（eXtensible Markup Language），标准通用标记语言的子集，是一种用于标记电子文件使其具有结构性的标记语言。你可以通过本站学习 X
解析:php调用MsSQL存储过程使用内置RETVAL获取过程中的return值
【PHP代码】复制代码代码如下: $stmt = mssql_init('P__Global_Test', $conn) or die("initialize sto
解析:清除SQL被注入恶意病毒代码的语句
在SQL查询分析器执行以下代码就可以了。复制代码代码如下: declare @t varchar(255),@c varchar(255) declare table_cursor curs
【JavaScript】前端算法题40道题+解析
前言最近练习了一些前端算法题，现在做个总结，以下题目都是个人写法，并不是标准答案，如有错误欢迎指出，有对某道题有新的想法的友友也可以在评论区发表想法，互相学习🤭 题目题目一: 二维数组中的

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

Parse XML with namespace attribute changing in Python(在Python中解析名称空间属性更改的XML)