gpt4 book ai didi

r - 根据相关节点的属性和文本值解析XML

转载 作者:行者123 更新时间:2023-12-04 05:00:15 24 4
gpt4 key购买 nike

之前用过XML包解析过HTML和XML,对xPath有初步的了解。然而,我被要求考虑 XML 数据,其中重要的位由元素本身的文本和属性以及相关节点中的属性组合确定。我从来没有这样做过。例如

[更新示例,稍微扩展]

<Catalogue>
<Bookstore id="ID910705541">
<location>foo bar</location>
<books>
<book category="A" id="1">
<title>Alpha</title>
<author ref="1">Matthew</author>
<author>Mark</author>
<author>Luke</author>
<author ref="2">John</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Beta</title>
<author ref="1">Huey</author>
<author>Duey</author>
<author>Louie</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>Gamma</title>
<author ref="1">Tweedle Dee</author>
<author ref="2">Tweedle Dum</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
<Bookstore id="ID910700051">
<location>foo</location>
<books>
<book category="A" id="1">
<title>Happy</title>
<author>Dopey</author>
<author>Bashful</author>
<author>Doc</author>
<author ref="1">Grumpy</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Ni</title>
<author ref="1">John</author>
<author ref="2">Paul</author>
<author ref="3">George</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>San</title>
<author ref="1">Ringo</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
<Bookstore id="ID910715717">
<location>bar</location>
<books>
<book category="A" id="1">
<title>Un</title>
<author ref="1">Winkin</author>
<author>Blinkin</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Deux</title>
<author>Nod</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>Trois</title>
<author>Manny</author>
<author>Moe</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
</Catalogue>

我想提取所有作者姓名,其中:
1) location 元素有一个包含“NY”的文本值
2) author 元素不包含“ref”属性;那是作者标签中不存在 ref 的地方

我最终需要在给定的书店内将提取的作者连接在一起,以便我的结果数据框是每个书店一行。我想在我的数据框中保留书店 ID 作为附加字段,以便我可以唯一地引用每个商店。
由于只有第一家 bokstore 位于纽约,因此这个简单示例的结果如下所示:
1 Jane Smith John Doe Karl Pearson William Gosset

如果另一家书店在其位置包含“NY”,则它将包含第二行,依此类推。

在这些复杂的条件下,我是否要求太多的 R 来解析?

最佳答案

require(XML)

xdata <- xmlParse(apptext)
xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]')
#[[1]]
#<author>Jane Smith</author>

#[[2]]
#<author>John Doe</author>

#[[3]]
#<author>Karl Pearson</author>

#[[4]]
#<author>William Gosset</author>

分割:

获取包含“NY”的所有位置
//*/location[text()[contains(.,"NY")]]

获取这些节点的书籍兄弟
/following-sibling::books

从这些笔记中获取没有 ref 属性的所有作者
/.//author[not(@ref)]

如果需要文本,请使用 xmlValue:
> xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]',xmlValue)
[1] "Jane Smith" "John Doe" "Karl Pearson" "William Gosset"

更新:
child.nodes <- xpathSApply(xdata,'//*/location[text()[contains(.,"NY")]]/following-sibling::books/.//author[not(@ref)]')

ans.func<-function(x){
xpathSApply(x,'.//ancestor::bookstore[@id]/@id')
}

sapply(child.nodes,ans.func)
# id id id id
#"1" "1" "1" "1"

更新 2:

使用您更改的数据
xdata <- '<Catalogue>
<Bookstore id="ID910705541">
<location>foo bar</location>
<books>
<book category="A" id="1">
<title>Alpha</title>
<author ref="1">Matthew</author>
<author>Mark</author>
<author>Luke</author>
<author ref="2">John</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Beta</title>
<author ref="1">Huey</author>
<author>Duey</author>
<author>Louie</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>Gamma</title>
<author ref="1">Tweedle Dee</author>
<author ref="2">Tweedle Dum</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
<Bookstore id="ID910700051">
<location>foo</location>
<books>
<book category="A" id="1">
<title>Happy</title>
<author>Dopey</author>
<author>Bashful</author>
<author>Doc</author>
<author ref="1">Grumpy</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Ni</title>
<author ref="1">John</author>
<author ref="2">Paul</author>
<author ref="3">George</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>San</title>
<author ref="1">Ringo</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
<Bookstore id="ID910715717">
<location>bar</location>
<books>
<book category="A" id="1">
<title>Un</title>
<author ref="1">Winkin</author>
<author>Blinkin</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="B" id="10">
<title>Deux</title>
<author>Nod</author>
<year>2005</year>
<price>29.99</price>
</book>
<book category="D" id="100">
<title>Trois</title>
<author>Manny</author>
<author>Moe</author>
<year>2005</year>
<price>29.99</price>
</book>
</books>
</Bookstore>
</Catalogue>'

请注意之前您有 bookstore现在 Bookstore . NY没了所以我用了 foo
require(XML)
xdata <- xmlParse(xdata)
child.nodes <- getNodeSet(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(@ref)]')

ans.func<-function(x){
xpathSApply(x,'.//ancestor::Bookstore[@id]/@id')
}

sapply(child.nodes,ans.func)
# id id id id id
#"ID910705541" "ID910705541" "ID910705541" "ID910705541" "ID910700051"
# id id
#"ID910700051" "ID910700051"

xpathSApply(xdata,'//*/location[text()[contains(.,"foo")]]/following-sibling::books/.//author[not(@ref)]',xmlValue)
# [1] "Mark" "Luke" "Duey" "Louie" "Dopey" "Bashful" "Doc"

关于r - 根据相关节点的属性和文本值解析XML,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16238043/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com