python - 在 Python 中逐项列出大型 xml 文件

转载作者：行者123 更新时间：2023-11-30 22:01:18

我正在设计某种 ETL 管道，我希望首先将输入 XML 数据集拆分为与每个项目相关的单独 XML 文件。输入数据集基本上是特定模型下元数据的导出(当前示例是 EDM)。我对 XSLT 相当满意，并希望使用它来避免在这个问题上使用太多的 Python，而这应该没有那么复杂。

我浏览了许多线程，包括 Lisa Daly 的 Fast_iter(参见 https://www.ibm.com/developerworks/xml/library/x-hiperfparse/ )。我尝试了不同的方法，但在写入文件时我总是陷入困境(没有输出或序列化问题)。正在寻找一些经验丰富的反馈吗？!

数据集结构

<rdf:RDF ...many namespaces...>
    <!--ITEM1 NODE-->
    <ore:aggregates>
        <edm:ProvidedCHO rdf:about="http://some/url"/>
        <ore:Aggregation rdf:about="http://some/url">
            <...>
        </ore:Aggregation>
        <ore:Proxy rdf:about="http://some/url">
            <...>
        </ore:Proxy>
        <edm:EuropeanaAggregation rdf:about="http://some/url">
            <...>      
        </edm:EuropeanaAggregation>
    </ore:aggregates>

    <!--ITEM2 NODE-->
    <ore:aggregates>
        <...>      
    </ore:aggregates>

    <!--ITEM3 NODE-->
    <ore:aggregates>
        <...>      
    </ore:aggregates>
</rdf:RDF>

预期结果

<!--ITEM 1-->
<rdf:RDF ...many namespaces...>
    <edm:ProvidedCHO rdf:about="http://some/url"/>
    <ore:Aggregation rdf:about="http://some/url">
        <...>
    </ore:Aggregation>
    <ore:Proxy rdf:about="http://some/url">
        <...>
    </ore:Proxy>
    <edm:EuropeanaAggregation rdf:about="http://some/url">
        <...>      
    </edm:EuropeanaAggregation>
</rdf:RDF>

当前试用

尝试使用 lxml 应用一次逐项 XSLT(脚本+xslt)

from lxml import etree as ET
    dom = ET.parse(source)
    xslt = ET.parse(xsl_filename)
    transform = ET.XSLT(xslt)
    newdom = transform(dom)
    print(ET.tostring(newdom, pretty_print=True))

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet exclude-result-prefixes="xsi xlink xml" version="2.0"
    xmlns:many="namespaces">

    <xsl:output encoding="UTF-8" indent="yes"/>

    <!--<xsl:param name="output" select="'/Users/yep/Code/+dev/test data/output/'"/>-->
    <xsl:param name="output" select="'/home/yep/data/split/'"/>
    <xsl:param name="children" select="/rdf:RDF/ore:aggregates"/>

    <!-- ROOT MATCH -->
    <xsl:template match="/">
        <xsl:for-each select="$children">
            <xsl:call-template name="itemize"/>
        </xsl:for-each>
    </xsl:template>

    <xsl:template name="itemize">

            <xsl:variable name="uri" select="translate(ore:Proxy/dc:identifier, ' ', '_')"/>
            <xsl:variable name="ns"/>
            <xsl:variable name="fullOutput" select="concat($output, $uri)"/>
            <xsl:result-document href="{$fullOutput}.xml" method="xml">
                <xsl:element name="rdf:RDF">
                    <xsl:copy-of select="namespace::*"/>
                    <xsl:copy-of select="*"/>
                </xsl:element>
            </xsl:result-document>
    </xsl:template>

</xsl:stylesheet>

...没有输出。也尝试过“写入”但不起作用

通过 ETree 尝试

import xml.etree.ElementTree as ET
    root = ET.parse(source).getroot()

    # namespaces variable generated from a json file
    jsonFile = open("application/models/namespaces.json")
    jsonStr = jsonFile.read()
    namespaces = json.loads(jsonStr)

    for item in root.findall("ore:aggregates",namespaces):
        newTree = ET.parse("/home/yep/application/services/create/sample.xml")
        newroot = newTree.getroot()

        for node in item.findall("edm:ProvidedCHO",namespaces):
            newroot.append(node)
            ET.SubElement(newroot,node)

        filename = "/home/yep/data/split/" + str(i) + ".xml"
        newTree.write(filename)

TypeError: cannot serialize <Element '{http://www.europeana.eu/schemas/edm/}ProvidedCHO' at 0x7f4768a03688> (type Element)

我认为这个问题与我没有正确处理 namespace 有关，或者可能是因为当数据是Python时我仍然采用XSLT方法......一些帮助将不胜感激:)

最佳答案

由于您尝试使用 lxml 处理 XSLT，因此您只能使用 XSLT 1.0。由于 1.0 不支持 xsl:result-document，因此您必须使用 exlst document 扩展(幸运的是 lxml 支持)。

这是一个例子...

XML 输入 (test.xml)

<rdf:RDF xmlns:rdf="http://some rdf uri" xmlns:edm="http://some edm uri" xmlns:ore="http://some ore uri">
    <!--ITEM1 NODE-->
    <ore:aggregates>
        <edm:ProvidedCHO rdf:about="http://some/url">from item1</edm:ProvidedCHO>
        <ore:Aggregation rdf:about="http://some/url">from item1</ore:Aggregation>
        <ore:Proxy rdf:about="http://some/url">from item1</ore:Proxy>
        <edm:EuropeanaAggregation rdf:about="http://some/url">from item1</edm:EuropeanaAggregation>
    </ore:aggregates>

    <!--ITEM2 NODE-->
    <ore:aggregates>
        <edm:ProvidedCHO rdf:about="http://some/url">from item2</edm:ProvidedCHO>
        <ore:Aggregation rdf:about="http://some/url">from item2</ore:Aggregation>
        <ore:Proxy rdf:about="http://some/url">from item2</ore:Proxy>
        <edm:EuropeanaAggregation rdf:about="http://some/url">from item2</edm:EuropeanaAggregation>
    </ore:aggregates>

    <!--ITEM3 NODE-->
    <ore:aggregates>
        <edm:ProvidedCHO rdf:about="http://some/url">from item3</edm:ProvidedCHO>
        <ore:Aggregation rdf:about="http://some/url">from item3</ore:Aggregation>
        <ore:Proxy rdf:about="http://some/url">from item3</ore:Proxy>
        <edm:EuropeanaAggregation rdf:about="http://some/url">from item3</edm:EuropeanaAggregation>
    </ore:aggregates>
</rdf:RDF>

XSLT 1.0(测试.xsl)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl">
  <xsl:strip-space elements="*"/>

  <xsl:template match="/*/*">
    <xsl:apply-templates select=".." mode="copy">
      <xsl:with-param name="target_id" select="generate-id()"/>
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match="/*" mode="copy">
    <xsl:param name="target_id"/>
    <exsl:document href="{$target_id}.xml" indent="yes">
      <xsl:copy>
        <xsl:copy-of select="@*|*[generate-id()=$target_id]/*"/>
      </xsl:copy>      
    </exsl:document>
  </xsl:template>

</xsl:stylesheet>

Python

from lxml import etree

tree = etree.parse("test.xml")
xslt = etree.parse("test.xsl")

tree.xslt(xslt)

输出(文件名基于生成的 ID，因此在运行我的代码时它们可能会有所不同。)

idm253366124.xml

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://some_rdf_uri" xmlns:edm="http://some_edm_uri" xmlns:ore="http://some_ore_uri">
  <edm:ProvidedCHO rdf:about="http://some/url">from item1</edm:ProvidedCHO>
  <ore:Aggregation rdf:about="http://some/url">from item1</ore:Aggregation>
  <ore:Proxy rdf:about="http://some/url">from item1</ore:Proxy>
  <edm:EuropeanaAggregation rdf:about="http://some/url">from item1</edm:EuropeanaAggregation>
</rdf:RDF>

idm219411756.xml

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://some_rdf_uri" xmlns:edm="http://some_edm_uri" xmlns:ore="http://some_ore_uri">
  <edm:ProvidedCHO rdf:about="http://some/url">from item2</edm:ProvidedCHO>
  <ore:Aggregation rdf:about="http://some/url">from item2</ore:Aggregation>
  <ore:Proxy rdf:about="http://some/url">from item2</ore:Proxy>
  <edm:EuropeanaAggregation rdf:about="http://some/url">from item2</edm:EuropeanaAggregation>
</rdf:RDF>

idm219410244.xml

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://some_rdf_uri" xmlns:edm="http://some_edm_uri" xmlns:ore="http://some_ore_uri">
  <edm:ProvidedCHO rdf:about="http://some/url">from item3</edm:ProvidedCHO>
  <ore:Aggregation rdf:about="http://some/url">from item3</ore:Aggregation>
  <ore:Proxy rdf:about="http://some/url">from item3</ore:Proxy>
  <edm:EuropeanaAggregation rdf:about="http://some/url">from item3</edm:EuropeanaAggregation>
</rdf:RDF>

<小时/>

更新动态路径...

XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:rdf="http://some_rdf_uri" xmlns:edm="http://some_edm_uri" 
  xmlns:ore="http://some_ore_uri"
  xmlns:exsl="http://exslt.org/common"
  extension-element-prefixes="exsl">
  <xsl:strip-space elements="*"/>

  <xsl:key name="elem_by_id" match="*" use="generate-id()"/>

  <xsl:template match="/*" name="root">
    <xsl:apply-templates select="*"/>
  </xsl:template>

  <xsl:template match="*">
    <xsl:apply-templates select="/*" mode="copy">
      <xsl:with-param name="target_id" select="generate-id()"/>
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match="/*" mode="copy">
    <xsl:param name="target_id"/>
    <exsl:document href="temp/{$target_id}.xml" indent="yes">
      <xsl:copy>
        <xsl:copy-of select="@*|key('elem_by_id',$target_id)/*"/>
      </xsl:copy>      
    </exsl:document>
  </xsl:template>

</xsl:stylesheet>

Python

from lxml import etree

tree = etree.parse("test.xml")
xslt = etree.parse("test.xsl")

target_path = "/rdf:RDF/ore:aggregates"

try:
    elem = xslt.xpath("/xsl:stylesheet/xsl:template[@name='root']/xsl:apply-templates",
                      namespaces={"xsl": "http://www.w3.org/1999/XSL/Transform"})[0]
    elem.attrib["select"] = target_path
except IndexError:
    print("Could not find xsl:template to update.")

tree.xslt(xslt)

关于python - 在 Python 中逐项列出大型 xml 文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54095195/

文章推荐： php - 如何显示类别的项目？

文章推荐： c# - Gridview Paging 分离了 Merge Cell Data

文章推荐： mysql - 针对特定场景使用 sql 查询检索数据

文章推荐： java - 在 HashMap Key 中输入 3 个属性

xml - 如何在没有源 xml 文件根节点的情况下将一个 xml 文件包含在另一个 xml 中？
正如标题中所问，我有两个如下结构的 XML 文件 A.xml //here I want to include B.xml
c# - 如何将等 xml 标签格式更改为
我有一个 xml 文件。根据我的要求，我需要更新空标签，例如我需要更改 to .是否可以像那样更改标签.. 谢谢... 最佳答案 var xmlString=" "; var properStri
xml - Golang : get inner xml from xml with xml.解码
我有这样简单的 XML: Song Playing 09:41:18 Frederic Delius Violin Son
xml - XML 阅读器是否应该忽略 XML 文件中的连续空格？
在我的工作中，我们有自己的 XML 类来构建 DOM，但我不确定应该如何处理连续的空格？例如 Hello World 当它被读入 DOM 时，文本节点应该包含 Hello 和 World
xml - 比较来自不同 XML 文件的元素值并附加到第一个 XML
我有以下 2 个 xml 文件，我必须通过比较 wd:Task_Name_ID 和 TaskID 的 XML 文件 2。例如，Main XML File-1 wd:Task_Name_ID 具有以下
xml - 使 XML 构建器从字符串中插入 XML
我在 Rails 应用程序中有一个 XML View ，需要从另一个文件插入 XML 以进行测试。我想说“构建器，只需盲目地填充这个字符串，因为它已经是 xml”，但我在文档中看不到这样做的任何内容
xml - XML 数据和 XML 元数据之间有什么区别？
我正在重建一些 XML 提要，因此我正在研究何时使用元素以及何时使用带有 XML 的属性。一些网站说“数据在元素中，元数据在属性中。” 那么，两者有什么区别呢？让我们以 W3Schools 为例:
xml - 文档中的多个 XML 声明是否为格式正确的 XML？
在同一个文档中有两个 XML 声明是否是格式正确的 XML？ hello 我相信不是，但是我找不到支持我的消息来源。来自 Extensible Markup Language
xml - 在 XML 中包装任意 XML
我需要在包装器 XML 文档中嵌入任意(语法上有效的)XML 文档。嵌入式文档被视为纯文本，在解析包装文档时不需要可解析。我知道“CDATA trick”，但如果内部 XML 文档本身包含 CDAT
xml - XML 解析器和 XML 处理器是否相同？
XML 解析器和 XML 处理器是两个不同的东西吗？他们是两个不同的工作吗？最佳答案 XML 解析器和 XML 处理器是一样的。它不适用于其他语言。 XML 是通用数据标记语言。解析 XML 文件已
xml - 在保留格式的同时从文件读取 XML 和从文件读取 XML
我使用这个 perl 代码从一个文件中读取 XML，然后写入另一个文件(我的完整脚本有添加属性的代码): #!usr/bin/perl -w use strict; use XML::DOM; use
xml - 使用 PowerShell 将 system.xml.xml 元素转换为 system.xml.xml 文档
我正在编写一个我了解有限的历史脚本。对象 A 的类型为 system.xml.xmlelement，我需要将其转换为类型 system.xml.xmldocument 以与对象 B 进行比较(类型
xml - 如何将子节点结构从一个 XML 文件复制到另一个 XML 文件(合并两个 XML 文件)？
我有以下两个 XML 文件: 文件1 101 102 103 501 502 503
xml - 如何将子节点结构从一个 XML 文件复制到另一个 XML 文件(合并两个 XML 文件)？
我有以下两个 XML 文件: 文件1 101 102 103 501 502 503
java - 转换性能 XML>XSL>XML 与 XML>JAXB>XML
我有一个案例，其中一个 xml 作为输入，另一个 xml 作为输出:我可以选择使用 XSL 和通过 JAXB 进行 Unmarshalling 编码。性能方面，有什么真正的区别吗？最佳答案首先，程
java - 从 XML 元素获取 XML 时的标签顺序(XML 包含 XML)？
我有包含 XML 的 XML，我想使用 JAXB 解析它 qwqweqwezxcasdasd eee 解析器 public static NotificationRequest parse(Strin
xml - 无法使用 XML 架构和 Perl (XML::LibXML) 验证 XML
xml: mario de2f15d014d40b93578d255e6221fd60 Mario F 23 maria maria
java.net.MalformedURLException : no protocol: [c:\XML\file. xml，c :\XML\file2. xml，c :\XML\file3. xml]
尝试更新 xml 文件数组时出现以下错误。代码片段: File dir = new File("c:\\XML"); File[] files = dir.listFiles(new Filenam
xml - 如何使用 ConvertTo-Xml 和 Select-Xml 加载或读取 XML 文件？
我怎样才能完成这样的事情: PS /home/nicholas/powershell> PS /home/nicholas/powershell> $date=(Get-Date | ConvertT
xml - 删除 XML 节点以将 XML 日志文件的大小减小到给定大小
我在从 xml 文件中删除节点时遇到一些困难。我发现很多其他人通过各种方式在 powershell 中执行此操作的示例，下面的代码似乎与我见过的许多其他示例相同，但我没有得到所需的行为。我的目标是将

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI