gpt4 book ai didi

java - 大文件的 DOM 与 SAX XML 解析

转载 作者:行者123 更新时间:2023-12-02 09:43:44 26 4
gpt4 key购买 nike

背景:

我有一个大型 OWL(Web 本体语言)文件(大约 125MB 或150 万行长),我想将其解析为一组制表符分隔值。我一直在研究 SAX 和 DOM XML 解析器,并发现以下内容:

  • SAX 允许逐个节点读取文档,因此整个文档不在内存中。
  • DOM 允许将整个文档一次放入内存中,但开销非常大。

SAX 与 DOM 对于大文件:

据我了解,

  • 如果我使用 SAX,我将必须逐个节点地迭代 150 万行代码。
  • 如果我使用DOM,我会有很大的开销,但结果会很快返回。

问题:

我需要能够对相同长度的类似文件多次使用此解析器。

因此,我应该使用哪个解析器?

加分:有谁知道有什么好的 JavaScript 解析器吗?我意识到很多都是为 Java 设计的,但我更喜欢 JavaScript。

最佳答案

认识 StAX

就像SAX一样,StAX遵循流式编程模型来解析XML。但是,它是 DOM 的双向读/写支持、易用性与 SAX 的 CPU 和内存效率的结合。

SAX 是只读的,推送解析会强制您在解析输入时立即处理事件和错误。另一方面,StAX 是一个拉式解析器,允许客户端在需要时调用解析器上的方法。这也意味着应用程序可以同时读取多个 XML 文件。

JAXP API 比较

╔══════════════════════════════════════╦═════════════════════════╦═════════════════════════╦═══════════════════════╦═══════════════════════════╗║          JAXP API Property           ║          StAX           ║           SAX           ║          DOM          ║           TrAX            ║╠══════════════════════════════════════╬═════════════════════════╬═════════════════════════╬═══════════════════════╬═══════════════════════════╣║ API Style                            ║ Pull events; streaming  ║ Push events; streaming  ║ In memory tree based  ║ XSLT Rule based templates ║║ Ease of Use                          ║ High                    ║ Medium                  ║ High                  ║ Medium                    ║║ XPath Capability                     ║ No                      ║ No                      ║ Yes                   ║ Yes                       ║║ CPU and Memory Utilization           ║ Good                    ║ Good                    ║ Depends               ║ Depends                   ║║ Forward Only                         ║ Yes                     ║ Yes                     ║ No                    ║ No                        ║║ Reading                              ║ Yes                     ║ Yes                     ║ Yes                   ║ Yes                       ║║ Writing                              ║ Yes                     ║ No                      ║ Yes                   ║ Yes                       ║║ Create, Read, Update, Delete (CRUD)  ║ No                      ║ No                      ║ Yes                   ║ No                        ║╚══════════════════════════════════════╩═════════════════════════╩═════════════════════════╩═══════════════════════╩═══════════════════════════╝

引用:
Does StAX Belong in Your XML Toolbox?

StAX is a "pull" type of API. As discussed, there are Cursor and Event Iterator APIs. There are both reading and writing sides of the API. It is more developer friendly than SAX. StAX, like SAX, does not require an entire document to be held in memory. However, unlike SAX, an entire document need not be read. Portions can be skipped. This may result in even improved performance over SAX.

关于java - 大文件的 DOM 与 SAX XML 解析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17310543/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com