I need a hint regarding a quick finding of specific nodes within the XML and removing the entire parent node (with children) if some of the values don't match the input parameters.
我需要一个提示来快速查找XML中的特定节点,并在某些值与输入参数不匹配时删除整个父节点(带有子节点)。
Example, having the XML as shown below:
示例,具有如下所示的XML:
<someparent attr="123" filters="+F1">
<filter id="F1">
<width>
<paper size="a4" val="10" />
<paper size="a3" val="12" />
</width>
<height>
<paper size="a4" val="10" />
<paper size="a3" val="12" />
</height>
</filter>
</someparent>
I should apply some rules:
我应该遵守一些规则:
- like if filters has a value starting with + (+F1) then if parameters match sizes and values, like: a4/10 or a3/12 should not remove the someparent node - any other size should causing the node removal
- if filters has a value starting with - (-F1) then if parameters matching sizes and values, like: a4/10 or a3/12 should remove the someparent node - any other size should leave the node intact
However, I think that may be irrelevant at this point. The most important is quickly finding the filter nodes and removing parent nodes if needed.
然而,我认为在这一点上这可能是无关紧要的。最重要的是快速找到筛选器节点,并在需要时删除父节点。
Extra notes:
额外说明:
- XPath is way too slow - literally unacceptable, Iterating over every single node is relatively quick - it's currently working like that - however, I'd like to improve that. I'm pretty sure it can be improved.
- it may happen that filter node(s) does not exist in the file at all
My plan is to create some prototypes, however... I'd appreciate any hints that may help me.
然而,我的计划是创造一些原型。如果有任何可能对我有帮助的提示,我将不胜感激。
更多回答
XSLT would be the first choice
XSLT将是首选
XML parsing is done in document order so the node to keep could be the last one and parsing efficiency could be similar in any case. Or as you said, the node may not exist at all but the whole doc was parsed anyway. Fast or slow is relative on XML. Moreover, finding might not be significant compared to writing the doc after removing a node. All in all, showing just a fragment of the xml without the code used to parse it is not enough to give advice.
XML解析是按文档顺序进行的,因此要保留的节点可能是最后一个,并且在任何情况下解析效率都可能是相似的。或者如您所说,节点可能根本不存在,但整个文档无论如何都被解析了。快或慢在XML上是相对的。此外,与删除节点后编写文档相比,查找可能并不重要。总而言之,只显示一段XML而不显示用于解析它的代码是不足以给出建议的。
优秀答案推荐
In general the different built-in parsers are SAX, StAX and DOM (https://rdayala.wordpress.com/dom-vs-sax-parsers/).
一般来说,不同的内置解析器是SAX、StAX和DOM(https://rdayala.wordpress.com/dom-vs-sax-parsers/).
- DOM is the slow one (load everything into memory) and is used with XPath.
- SAX is a pain to use.
- StAX actually has 2 APIs:
- the iterator API, e.g. XMLEventReader (easier)
- the cursor API, e.g. XMLStreamReader (more efficient)
You could also try using XSLT, but the built-in one isn't necessarily the most high performing and you may need to pay for a premium one or to use all its features (streamed processing):
https://docs.oracle.com/javase/tutorial/jaxp/xslt/transformingXML.html
您也可以尝试使用XSLT,但内置的XSLT不一定是最高性能的,您可能需要花钱购买高级的XSLT,或者使用它的所有特性(流处理):https://docs.oracle.com/javase/tutorial/jaxp/xslt/transformingXML.html
This xslt 1.0 will do your job:
这个XSLT 1.0将完成您的工作:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<!-- identity transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="someparent[@filters='+F1'][not(filter/width[paper[concat(@size,'/',@val)='a4/10'] and paper[concat(@size,'/',@val)='a3/12']])]"/>
<xsl:template match="someparent[@filters='-F1'][(filter/width[paper[concat(@size,'/',@val)='a4/10'] and paper[concat(@size,'/',@val)='a3/12']])]"/>
</xsl:stylesheet>
Obviously, you need to parse the whole document, and the fastest way way to arrive at a solution is to not include the "filtered out" elements in the document building process. Both DOM4J and JDOM are good alternatives for this, since they allow custom document builders that can defer or allow the tree construction based on previously obtained conditions. SAX/StAX is of course also an alternative, but at a lower level and require more infrastructure code to get a result.
显然,您需要解析整个文档,而获得解决方案的最快方法是在文档构建过程中不包括“过滤掉的”元素。DOM4J和JDOM都是很好的替代方案,因为它们允许定制文档构建器,这些构建器可以推迟或允许基于先前获得的条件构建树。当然,SAX/StAX也是一种替代方案,但级别较低,需要更多的基础设施代码才能得到结果。
Search this site for DOM4J/JDOM and builder, I may already have given the answer ;)
在这个站点搜索DOM4J/JDOM和BUILDER,我可能已经给出了答案;)
更多回答
OK, thanks. I'm going to try XMLStreamReader first as the performance is the highest priority for me.
那好,谢谢。我将首先尝试XMLStreamReader,因为性能对我来说是最重要的。
Apparently the woodstox implementation of StAX is pretty fast, but you could also try Aalto XML.
显然StAX的Woodstox实现相当快,但您也可以尝试使用Aalto XML。
I've never thought it could be done that way. Interesting. I'll put that on the list of things to test. Thank you!
我从来没有想过会这样做。有意思的。我会把它放在测试的清单上。谢谢!
I'm going to try that. I'll edit my post above when I get some concrete results. Thank you!
我要试一试。当我得到一些具体的结果时,我会编辑我上面的帖子。谢谢!
我是一名优秀的程序员,十分优秀!