gpt4 book ai didi

xml - 在 R 中提取 XML 节点和属性

转载 作者:数据小太阳 更新时间:2023-10-29 02:15:11 24 4
gpt4 key购买 nike

我有一个如下所示的 XML 数据集:

<protocol ID='.'>
<HEAD></HEAD>
<block ID='...'>
<HEAD></HEAD>
<trial ID='.....'>
<HEAD></HEAD>
<seq ID=''>
<HEAD></HEAD>
<calibration CLASS='affine-calibration' ID='New Calibration'>
<AX>.........</AX>
<BX>-........</BX>
<AY>.........</AY>
<BY>.........</BY>
<type>'por'</type>
</calibration>
<POR TIME='......'>
<PUPIL>.</PUPIL>
<BLINK>.</BLINK>
<V>...</V>
<H>...</H>
<PLANEINTRWV>...</PLANEINTRWV>
<PLANEINTRWH>...</PLANEINTRWH>
<PLANE>.</PLANE>
</POR>
<POR TIME='......'>
<PUPIL>.</PUPIL>
<BLINK>.</BLINK>
<V>...</V>
<H>...</H>
<PLANEINTRWV>...</PLANEINTRWV>
<PLANEINTRWH>...</PLANEINTRWH>
<PLANE>.</PLANE>
</POR>
<POR TIME='......'>
<PUPIL>.</PUPIL>
<BLINK>.</BLINK>
<V>...</V>
<H>...</H>
<PLANEINTRWV>...</PLANEINTRWV>
<PLANEINTRWH>...</PLANEINTRWH>
<PLANE>.</PLANE>
</POR>
</seq>
</trial>
<trial ID='.....'>
<HEAD></HEAD>
<seq ID=''>
<HEAD></HEAD>
<calibration CLASS='affine-calibration' ID='New Calibration'>
<AX>.........</AX>
<BX>-........</BX>
<AY>.........</AY>
<BY>.........</BY>
<type>'por'</type>
</calibration>
<POR TIME='......'>
<PUPIL>.</PUPIL>
<BLINK>.</BLINK>
<V>...</V>
<H>...</H>
<PLANEINTRWV>...</PLANEINTRWV>
<PLANEINTRWH>...</PLANEINTRWH>
<PLANE>.</PLANE>
</POR>
<POR TIME='......'>
<PUPIL>.</PUPIL>
<BLINK>.</BLINK>
<V>...</V>
<H>...</H>
<PLANEINTRWV>...</PLANEINTRWV>
<PLANEINTRWH>...</PLANEINTRWH>
<PLANE>.</PLANE>
</POR>
</seq>
</trial>
</block>
</protocol>

使用 XML 包,提取 POR 标签的子标签和标签属性的最干净的方法是什么?

我把这个有效的拼凑在一起,但它很慢(很可能是由于 xpathSApply 调用)并且很难读。

trackToDataFrame = function(file) {
doc2=xmlParse(file)
timeStamps = t(xpathSApply(doc2, '//*[@TIME]', function(x) c(name=xmlName(x), xmlAttrs(x))))
dd2 = xmlToDataFrame(getNodeSet(doc2, "//POR"), colClasses=c(rep("integer", 7)))
dd2 = cbind(dd2, timeStamps)
dd2
}

调用数据集返回:

  PUPIL BLINK  V  H PLANEINTRWV PLANEINTRWH PLANE name   TIME
1 NA NA NA NA NA NA NA POR ......
2 NA NA NA NA NA NA NA POR ......
3 NA NA NA NA NA NA NA POR ......
4 NA NA NA NA NA NA NA POR ......
5 NA NA NA NA NA NA NA POR ......

我认为整个事情可以通过单个 xmlToDataFrame 调用完成,但我对 XML 包还不够熟悉,无法让它工作。

我真正感兴趣的是“TIME”列以及从 xmlToDataFrame 调用中提取的所有列。

最佳答案

require(XML)
Fun1 <- function(xdata){
dum <- xmlParse(xdata)
xDf <- xmlToDataFrame(nodes = getNodeSet(dum, "//*/POR"), stringsAsFactors = FALSE)
xattrs <- xpathSApply(dum, "//*/POR/@TIME")
xDf$name <- "POR"
xDf$TIME <- xattrs
xDf
}

Fun2 <-function(xdata){
dumFun <- function(x){
xname <- xmlName(x)
xattrs <- xmlAttrs(x)
c(sapply(xmlChildren(x), xmlValue), name = xname, xattrs)
}
dum <- xmlParse(xdata)
as.data.frame(t(xpathSApply(dum, "//*/POR", dumFun)), stringsAsFactors = FALSE)
}

> identical(Fun1(xdata), Fun2(xdata))
[1] TRUE

library(rbenchmark)

benchmark(Fun1(xdata), Fun2(xdata))

test replications elapsed relative user.self sys.self user.child
1 Fun1(xdata) 100 1.047 2.069 1.044 0 0
2 Fun2(xdata) 100 0.506 1.000 0.504 0 0
sys.child
1 0
2 0

关于xml - 在 R 中提取 XML 节点和属性,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16805050/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com