gpt4 book ai didi

sparql - 如何优化返回可选属性的 SPARQL 查询?

转载 作者:行者123 更新时间:2023-12-02 06:39:11 26 4
gpt4 key购买 nike

如何优化如下所示的 SPARQL 查询?

此查询的意图是:

  1. 指定资源(国家/地区资源,其中 countryCode = "US")
  2. 获取资源上定义的可选属性。

不幸的是,OPTIONAL block 在父 block 之前进行评估,这导致查询引擎加载所有国家/地区的所有数据。

我想要的是类似 LEFT OUTER JOIN 行为,但查询引擎不会以这种方式处理它。

如何提高查询性能?

SELECT  *
WHERE
{
?type (rdfs:subClassOf)* gj:Country .
?this_0 rdf:type ?type ;
gn:countryCode "US"
# each of these blocks is executed as a standalone query in the engine
OPTIONAL
{ ?this_0 gn:countryCode ?countryCode_1}
OPTIONAL
{ ?this_0 gn:name ?name_2}
OPTIONAL
{ ?this_0 gj:cscId ?cscId_3}
}

我在 MarkLogic 8.4 中使用 SPARQL REST 端点。

更新:

我尝试使用 optimize=2 选项进行查询,但它并没有给我带来显着的性能改进:

/v1/graphs/sparql?optimize=2

相关: How do I specify options in the SPARQL REST endpoint for MarkLogic?

更新2:

即使我将可选属性之一设置为必需,查询仍然运行缓慢:

WHERE
{
?type (rdfs:subClassOf)* gj:Country .
?this_0 rdf:type ?type ;
gn:countryCode "US"; gj:cscId ?cscId_3 ;
}

我需要做一些特殊的事情来索引这个 gj:cscId 属性吗?

更新3:

这是来自查询控制台的配置文件信息。

Query profile

更新4:

这是诊断跟踪信息:

2017-04-27 13:30:17.238 Info: [Event:id=SPARQL Value Frequencies] sessionKey=13846462700334370907 namedGraphs=0 values=
2017-04-27 13:30:17.238 Info: <triple-value-statistics count="154569757" unique-subjects="25445373" unique-predicates="104" unique-objects="67520361" xmlns="cts:triple-value-statistics">
2017-04-27 13:30:17.238 Info: <triple-value-entries>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="181">
2017-04-27 13:30:17.238 Info: <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="179" unique-subjects="179" unique-predicates="4"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="15">
2017-04-27 13:30:17.238 Info: <triple-value>http://www.w3.org/2000/01/rdf-schema#subClassOf</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="15" unique-subjects="15" unique-objects="5"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="8739716">
2017-04-27 13:30:17.238 Info: <triple-value>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="8359510" unique-subjects="8341619" unique-objects="14"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="8697064">
2017-04-27 13:30:17.238 Info: <triple-value>http://www.geonames.org/ontology#countryCode</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="8323137" unique-subjects="8323137" unique-objects="517"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="1" unique-subjects="1" unique-predicates="1"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="2119305">
2017-04-27 13:30:17.238 Info: <triple-value datatype="http://www.w3.org/2001/XMLSchema#string">US</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="2061783" unique-subjects="2061783" unique-predicates="3"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: <triple-value-entry count="13946907">
2017-04-27 13:30:17.238 Info: <triple-value>http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId</triple-value>
2017-04-27 13:30:17.238 Info: <subject-statistics count="3" unique-predicates="3" unique-objects="3"/>
2017-04-27 13:30:17.238 Info: <predicate-statistics count="11739004" unique-subjects="11739004" unique-objects="11739004"/>
2017-04-27 13:30:17.238 Info: <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-04-27 13:30:17.238 Info: </triple-value-entry>
2017-04-27 13:30:17.238 Info: </triple-value-entries>
2017-04-27 13:30:17.238 Info: </triple-value-statistics>
2017-04-27 13:30:17.239 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
2017-04-27 13:30:17.239 Info: initialPlan=SPARQLModule[
2017-04-27 13:30:17.239 Info: Prolog[]
2017-04-27 13:30:17.239 Info: SPARQLSelect[SPARQLProject[order()
2017-04-27 13:30:17.239 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info: GraphNode[Var cscId_3 2]
2017-04-27 13:30:17.239 Info: SPARQLLeftNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
2017-04-27 13:30:17.239 Info: SPARQLNestedLoopJoin[order() hash(1==1) scatter(1 = 1)
2017-04-27 13:30:17.239 Info: SPARQLScatterJoin[order(0,1) hash(0==0) scatter(0 = 0)
2017-04-27 13:30:17.239 Info: SPARQLZeroOrOne[
2017-04-27 13:30:17.239 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.239 Info: SPARQLScatterOneOrMore[
2017-04-27 13:30:17.239 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.239 Info: GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.239 Info: GraphNode[Var ANON7634081659815295853 1]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.239 Info: TriplePattern[order(0,1) PSO
2017-04-27 13:30:17.239 Info: GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
2017-04-27 13:30:17.239 Info: GraphNode[Var ANON7634081659815295853 1]]]]
2017-04-27 13:30:17.239 Info: TriplePattern[order(0,1) OPS
2017-04-27 13:30:17.239 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-04-27 13:30:17.239 Info: GraphNode[Var type 0]]]
2017-04-27 13:30:17.239 Info: TriplePattern[order(1) SOP
2017-04-27 13:30:17.239 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
2017-04-27 13:30:17.239 Info: GraphNode[Literal "US"]]]
2017-04-27 13:30:17.239 Info: TriplePattern[order(1,2) PSO
2017-04-27 13:30:17.239 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.239 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
2017-04-27 13:30:17.239 Info: GraphNode[Var cscId_3 2]]]]]]
2017-04-27 13:30:17.239 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 optimize=1 r=3 t=1.28811 os=360 is=15 mutations=30 seed=7088858925989728751
2017-04-27 13:30:17.239 Info: initialCost=(m:5.99223e+11,r:0,io:(52.9404/167736/1.17487e+09),cpu(1):(0/1.77017e+08/1.18652e+12),mem:8185,c:1.03266e+07,crd:[14,2.06178e+06,1.03266e+07])
2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=0
2017-04-27 13:30:17.320 Info: cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.320 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=1
2017-04-27 13:30:17.320 Info: cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907 diff=-5.98971e+11 diff%=-99.958 r=2
2017-04-27 13:30:17.326 Info: cost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL Cost Analysis] sessionKey=13846462700334370907
2017-04-27 13:30:17.326 Info: bestCost=(m:2.51757e+08,r:0,io:(52.9404/322.031/4.68406e+07),cpu(4):(0/159/3.51041e+07),mem:415.68,c:6.46969e+06,crd:[14,2.06178e+06,6.46969e+06])
2017-04-27 13:30:17.326 Info: [Event:id=SPARQL AST] sessionKey=13846462700334370907
2017-04-27 13:30:17.326 Info: plan=SPARQLModule[
2017-04-27 13:30:17.326 Info: Prolog[]
2017-04-27 13:30:17.326 Info: SPARQLSelect[SPARQLProject[order(1,0)
2017-04-27 13:30:17.326 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info: GraphNode[Var cscId_3 2]
2017-04-27 13:30:17.326 Info: SPARQLRightMergeJoin[order(1,0) hash(1==1) scatter()
2017-04-27 13:30:17.326 Info: TriplePattern[order(1,2) PSO
2017-04-27 13:30:17.326 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#cscId>]
2017-04-27 13:30:17.326 Info: GraphNode[Var cscId_3 2]]
2017-04-27 13:30:17.326 Info: SPARQLHashJoin[order(1,0) hash(0==0) scatter()
2017-04-27 13:30:17.326 Info: SPARQLZeroOrOne[
2017-04-27 13:30:17.326 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.326 Info: SPARQLBloomOneOrMore[
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://kb.everest.cscglobal.com/geonames-jurisdiction/1.0/schema#Country>]
2017-04-27 13:30:17.326 Info: GraphNode[Var ANON7634081659815295853 1]
2017-04-27 13:30:17.326 Info: GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.326 Info: GraphNode[Var type 0]
2017-04-27 13:30:17.326 Info: TriplePattern[order(0,1) PSO
2017-04-27 13:30:17.326 Info: GraphNode[Var ANON16629111911678922088 0]
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://www.w3.org/2000/01/rdf-schema#subClassOf>]
2017-04-27 13:30:17.326 Info: GraphNode[Var ANON7634081659815295853 1]]]]
2017-04-27 13:30:17.326 Info: SPARQLMergeJoin[order(1,0) hash(1==1) scatter()
2017-04-27 13:30:17.326 Info: TriplePattern[order(1) OPS
2017-04-27 13:30:17.326 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://www.geonames.org/ontology#countryCode>]
2017-04-27 13:30:17.326 Info: GraphNode[Literal "US"]]
2017-04-27 13:30:17.326 Info: TriplePattern[order(1,0) PSO
2017-04-27 13:30:17.326 Info: GraphNode[Var this_0 1]
2017-04-27 13:30:17.326 Info: GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-04-27 13:30:17.326 Info: GraphNode[Var type 0]]]]]]]]

更新5:

在某些用例中,我发现可以从查询中消除 ?type 属性路径表达式。在这种情况下,性能提高了两个数量级:

WHERE
{
?this_0 rdf:type gj:Country ;
gn:countryCode "US"
# each of these blocks is executed as a standalone query in the engine
OPTIONAL
{ ?this_0 gn:countryCode ?countryCode_1}
OPTIONAL
{ ?this_0 gn:name ?name_2}
OPTIONAL
{ ?this_0 gj:cscId ?cscId_3}
}

由于此解决方案更改了查询的输出,因此它无法解决我们的所有用例。

问题似乎不在于 OPTIONAL 本身,而是与混淆查询规划器的属性路径表达式有关,因此 OPTIONAL block 中的属性是独立查找的(这不是高性能的)。

最佳答案

查询优化器依赖于使用统计信息来确定操作的最佳顺序。通常会有一个限制性的三元组模式,可用于使用分散连接来限制进一步的操作。

就您而言,统计数据没有提供如此明显的限制性三重模式。通过查看三重值统计输出,您可以看到字符串“US”作为对象出现了 2061783 次 - 因此这并不是非常严格的限制。

gj:Country IRI 是有限制的(在对象位置有 179 次),但不幸的是,您需要在传递闭包运算符的右侧使用它。很难预测传递闭包运算符将返回多少结果,因为它很大程度上取决于实际数据。

您会发现,使用如下所示的属性路径将使 MarkLogic 避免使用零或一运算符,这可能会带来很小的性能提升:

?this_0 a/rdfs:subClassOf* gj:Country .

更进一步,如果您知道(例如)只有一个国家/地区代码为“US”的 gj:Country,您可以向查询的该部分添加限制,以向优化器提示如何处理查询,即:

select * {
{
select * {
?this_0 a/rdfs:subClassOf* gj:Country .
?this_0 gn:countryCode 'US' .
} limit 1
}
OPTIONAL { ?this_0 gj:cscId ?cscId_3 }
}

关于sparql - 如何优化返回可选属性的 SPARQL 查询?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43642183/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com