gpt4 book ai didi

r - 从具有不同元素数量的 xml 创建数据框

转载 作者:行者123 更新时间:2023-12-04 12:16:05 24 4
gpt4 key购买 nike

我是 R 的新手,试图从 xml 文档中获取一些数据到数据框中。当每个节点都有相同数量的子节点时,它工作正常,但如果没有,我就会遇到问题。

我使用的R代码如下:

library(XML)

xml.m_data <- xmlParse(file="mdata.xml")

df.m_data <- xmlToDataFrame(xml.m_data,nodes = getNodeSet(xml.m_data,"//Row"),collectNames=T)

nodeset <- getNodeSet(xml.m_data,"//Row")[[1]]
colnames <- c()
i <- NULL
for(i in 1:(length(df.m_data))){
x <- toString.XMLNode(nodeset[i])
x <- strsplit(x,"\"")[[1]][2]
colnames[i] <- x
}
colnames(df.m_data) <- colnames
rm(colnames)

我试图获得的结果看起来像这样(来自第二个 XML 的结果):

 CompanyID ProdConsID      UnitID  UnitName Commodity Facility     Source Commercialisation           StartDate
1 COMPANY001 E000001 E000001-001 Name_001 Power Producer Fossil Gas False 2010-01-31T23:00:00
2 COMPANY002 E000002 E000002-001 Name_002 Power Producer Fossil Gas False 2010-01-31T23:00:00
3 COMPANY003 E000003 E000003-001 Name_003A Power Producer Fossil Gas True 2009-10-25T23:00:00
4 COMPANY003 E000003 E000003-002 Name_003B Power Producer Fossil Gas True 2009-10-25T23:00:00

有两个 xml,第一个我可以处理,第二个不行。

对于第二个,我得到以下错误:

Error in `[<-.data.frame`(`*tmp*`, i, names(nodes[[i]]), value = c("COMPANY001",  : duplicate subscripts for columns

示例 1:

<Results>
<Result>
<Row>
<Field Name="CompanyID">COMPANY001</Field>
<Field Name="ProdConsID">E000001</Field>
<Field Name="UnitID">E000001-001</Field>
<Field Name="UnitName">Name_001</Field>
<Field Name="Commodity">Power</Field>
<Field Name="Facility">Producer</Field>
<Field Name="Source">Fossil Gas</Field>
<Field Name="Commercialisation">False</Field>
<Field Name="StartDate">2010-01-31T23:00:00</Field>
</Row>
<Row>
<Field Name="CompanyID">COMPANY002</Field>
<Field Name="ProdConsID">E000002</Field>
<Field Name="UnitID">E000002-001</Field>
<Field Name="UnitName">Name_002</Field>
<Field Name="Commodity">Power</Field>
<Field Name="Facility">Producer</Field>
<Field Name="Source">Fossil Gas</Field>
<Field Name="Commercialisation">False</Field>
<Field Name="StartDate">2010-01-31T23:00:00</Field>
</Row>
<Row>
<Field Name="CompanyID">COMPANY003</Field>
<Field Name="ProdConsID">E000003</Field>
<Field Name="UnitID">E000003-001</Field>
<Field Name="UnitName">Name_003A</Field>
<Field Name="Commodity">Power</Field>
<Field Name="Facility">Producer</Field>
<Field Name="Source">Fossil Gas</Field>
<Field Name="Commercialisation">True</Field>
<Field Name="StartDate">2009-10-25T23:00:00</Field>
</Row>
<Row>
<Field Name="CompanyID">COMPANY003</Field>
<Field Name="ProdConsID">E000003</Field>
<Field Name="UnitID">E000003-002</Field>
<Field Name="UnitName">Name_003B</Field>
<Field Name="Commodity">Power</Field>
<Field Name="Facility">Producer</Field>
<Field Name="Source">Fossil Gas</Field>
<Field Name="Commercialisation">True</Field>
<Field Name="StartDate">2009-10-25T23:00:00</Field>
</Row>
</Result>
</Results>

示例 2:

<Results>
<Result>
<Row>
<Field Name="CompanyID">COMPANY001</Field>
<Field Name="ProdConsID">E000001</Field>
<Field Name="UnitID">E000001-001</Field>
<Field Name="UnitName">Name_001</Field>
<Field Name="Commodity">Power</Field>
<Field Name="Facility">Producer</Field>
<Field Name="Source">Fossil Gas</Field>
<Field Name="Commercialisation">False</Field>
<Field Name="StartDate">2010-01-31T23:00:00</Field>
<Field Name="EndDate">2015-12-09T23:00:00</Field>
</Row>
<Row>
<Field Name="CompanyID">COMPANY002</Field>
<Field Name="ProdConsID">E000002</Field>
<Field Name="UnitID">E000002-001</Field>
<Field Name="UnitName">Name_002</Field>
<Field Name="Commodity">Power</Field>
<Field Name="Facility">Producer</Field>
<Field Name="Source">Fossil Gas</Field>
<Field Name="Commercialisation">False</Field>
<Field Name="StartDate">2010-01-31T23:00:00</Field>
<Field Name="EndDate">2015-12-09T23:00:00</Field>
</Row>
<Row>
<Field Name="CompanyID">COMPANY003</Field>
<Field Name="ProdConsID">E000003</Field>
<Field Name="UnitID">E000003-001</Field>
<Field Name="UnitName">Name_003A</Field>
<Field Name="Commodity">Power</Field>
<Field Name="Facility">Producer</Field>
<Field Name="Source">Fossil Gas</Field>
<Field Name="Commercialisation">True</Field>
<Field Name="StartDate">2009-10-25T23:00:00</Field>
</Row>
<Row>
<Field Name="CompanyID">COMPANY003</Field>
<Field Name="ProdConsID">E000003</Field>
<Field Name="UnitID">E000003-002</Field>
<Field Name="UnitName">Name_003B</Field>
<Field Name="Commodity">Power</Field>
<Field Name="Facility">Producer</Field>
<Field Name="Source">Fossil Gas</Field>
<Field Name="Commercialisation">True</Field>
<Field Name="StartDate">2009-10-25T23:00:00</Field>
</Row>
</Result>
</Results>

非常感谢任何有用的见解。

最佳答案

由于语法更简单,我更喜欢 xml2 包而不是 XML。这里的解决方案读取所有“行”父节点,然后解析其中的每一个以获得一系列单行数据帧,然后将所有结果合并到最终答案中。
dplyr 包中的 bind_rows() 函数可以处理缺失的列。
有关详细信息,请参阅代码注释。

library(xml2)
library(dplyr)

#list of files to process
fnames<-"results2.xml"

doc<-read_xml(fnames)

#find parent nodes
parents<-xml_find_all(doc, ".//Row")

dfs<-lapply(parents, function(parent) {

#find all of the nodes/records under each parent node
titles <- xml_children(parent) %>% html_attr("Name")
values <- xml_children(parent) %>% html_text()

#make data frame of the values and column headings
df<-as.data.frame(t(values), stringsAsFactors = FALSE)
names(df)<-titles
df
})

#Make combinded dataframe
answer<-bind_rows(dfs)
answer



CompanyID ProdConsID UnitID UnitName Commodity Facility Source Commercialisation StartDate EndDate
1 COMPANY001 E000001 E000001-001 Name_001 Power Producer Fossil Gas False 2010-01-31T23:00:00 2015-12-09T23:00:00
2 COMPANY002 E000002 E000002-001 Name_002 Power Producer Fossil Gas False 2010-01-31T23:00:00 2015-12-09T23:00:00
3 COMPANY003 E000003 E000003-001 Name_003A Power Producer Fossil Gas True 2009-10-25T23:00:00 <NA>
4 COMPANY003 E000003 E000003-002 Name_003B Power Producer Fossil Gas True 2009-10-25T23:00:00 <NA>

关于r - 从具有不同元素数量的 xml 创建数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60622303/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com