gpt4 book ai didi

Java - SaxParser/DocumentBuilder "failing"获取正确的标签主体

转载 作者:行者123 更新时间:2023-12-01 04:54:57 25 4
gpt4 key购买 nike

我遇到了一种情况,我需要读取多个 xml 文件并从中构建单个模型。遗憾的是,这些文件是由我绝对无法更改的遗留系统生成的。

给我带来麻烦的 XML 文件之一看起来或多或少像这样(经过修改以删除专有数据):

<resource lang="en" dataId="900">
numbered content here, 900-919 ...

<string name="920-name">Document Shredder</string>
<string name="920-desc">A machine ideal for destroying documents that deserve it. It can cross-shred anything from tissue paper to small netbooks with minimal noise. Remember, hackers can't access the documents if you've shredded the drives.</string>
<string name="920-cat">office,appliance</string>
<string name="921-name">Plastic Ladle</string>
<string name="921-desc">This is a big plastic ladle, ideal for soups and sauces.</string>
<string name="921-cat">kitchen,utensils</string>

... similar numbered content here, 922-934 ...

<string name="935-name">Green Laser Pointer</string>
<string name="935-desc">A High-Powered green laser pointer, ideal for irritating cats.</string>
<string name="935-cat">office,tool</string>
<string name="936-name">Black Metal Filing Cabinet</string>
<string name="936-desc">A large, metal cabinet (black) built to store hanging file folders.</string>
<string name="936-cat">office,storage</string>

... similar numbered content here, 937-994
</resource>

我将其解析为 List<CString> ,其中CString.java是:

public class CString {
public String name;
public String desc;

@Override
public String toString() {
return "CString {!name: " + name + " !body: " + body + "}\n";
}
}

我尝试过使用 DocumentBuilder ,并且,当这不起作用时,只需一个简单的 SaxParser 。但无论我如何处理,当我回顾我的CString时s,我有一些正文实际上包含文档不同部分的未解析标签。例如,打印出我前面提到的 List<CString>可能会产生类似的结果:

[ CStrings for 900-919 ...

, CString {!name: 920-name !body: Document Shredder}
, CString {!name: 920-desc !body: irritating cats.</string>
<string name="935-cat">office,tool</string>
<string name="936-name">Black Metal Filing Cabinet</e. Remember, hackers can't access the documents if you've shredded the drives.}
, CString {!name: 920-cat !body: office,appliance}
, CString {!name: 921-name !body: Plastic Ladle}
, CString {!name: 921-desc !body: This is a big plastic ladle, ideal for soups and sauces.}
, CString {!name: 921-cat !body: kitchen,utensils}

... CStrings for 922-934 ...

, CString {!name: 935-name !body: Green Laser Pointer}
, CString {!name: 935-desc !body: A High-Powered green laser pointer, ideal for irritating cats.}
, CString {!name: 935-cat !body: office,tool}
, CString {!name: 936-name !body: Black Metal Filing Cabinet}
, CString {!name: 936-desc !body: A large, metal cabinet (black) built to store hanging file folders.}
, CString {!name: 936-cat !body: office,storage}

... CStrings for 937-994
]

SaxParser我的代码版本,我有以下 characters我的方法DefaultHandler :

public void characters(char ch[], int start, int length) throws SAXException {
String value = new String(ch, start, length).trim();
switch(currentQName.toString()) { // currentQName is a StringBuilder that holds just the current xml element's name
case "string":
if (value.contains("</string")) {
System.err.println("!!! Parse Error !!! " + value);
}
}

正如您可能已经猜到的那样,会产生:

!!! Parse Error !!! irritating cats.</string>
<string name="935-cat">office,tool</string>
<string name="936-name">Black Metal Filing Cabinet</e. Remember, hackers can't access the documents if you've shredded the drives.

我通常不会问这么深奥的问题,特别是当我无法提供具体的数据和代码时,但谷歌搜索似乎没有产生任何我能够确定的东西,当然,代码不会抛出(或抑制)任何异常。

我注意到的一件事是,当存在错误数据时,如上面 920-desc 的 CString 所示,在这种情况下,错误数据的长度为 138 个字符,而并非巧合的是,正确的数据恰好拾取了 139 个字符变成它应该的样子。这让我认为这是某种缓冲区问题。然而,我是否让DocumentBuilder管理缓冲区,或者我尝试使用直接 SaxParser 更手动地管理它们,我每次仍然在相同的地方得到完全相同的错误文本。最后,在处理较短的字符串、名称和猫时,我没有注意到任何错误的文本,我认为这也表明了字符缓冲区问题。

任何想法都会有帮助!

最佳答案

几乎可以肯定,您没有格式良好的 XML(您关于绝对不允许更改源系统的评论是一个坏兆头,但您并不是唯一陷入这种困境的人。)

看看这个问题How to parse badly formed XML in Java?

如果我是您,我会使用原始字符串操作和/或正则表达式来直接提取数据或将其修复为格式良好的 XML。顺便说一句,JAXB 更适合在 Java 中处理 XML(但仍然需要它格式良好)

关于Java - SaxParser/DocumentBuilder "failing"获取正确的标签主体,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14347968/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com