gpt4 book ai didi

python - 从嵌套的 xml 文件创建 pandas 数据框

转载 作者:太空宇宙 更新时间:2023-11-03 14:22:05 26 4
gpt4 key购买 nike

这是 xml 文件的一小部分。我想从此创建一个数据库,其中每个标签都有唯一的列名称和非重复数据。

尝试使用lxml,到目前为止我能做的最好的事情就是创建一个数据框,结果如下:

"    
SRCSGT
DATE 11112017
AGENCY Department of Veterans Affairs
OFFICE Canandaigua VAMC
LOCATION Department of Veterans Affairs Medical Center
ZIP 14424
etc, etc, "

XML

<?xml version="1.0" encoding="UTF-8"?>
<NOTICES>
<SRCSGT>
<DATE>11112017</DATE>
<AGENCY><![CDATA[Department of Veterans Affairs]]></AGENCY>
<OFFICE><![CDATA[Canandaigua VAMC]]></OFFICE>
<LOCATION><![CDATA[Department of Veterans Affairs Medical Center]]></LOCATION>
<ZIP>14424</ZIP>
<CLASSCOD>H</CLASSCOD>
<NAICS>238210</NAICS>
<OFFADD><![CDATA[Department of Veterans Affairs;400 Fort Hill Ave.;Canandaigua NY 14424]]></OFFADD>
<SUBJECT><![CDATA[H--3 YEAR TESTING/MAINTENANCE OF ELECTRICAL EQUIPMENT AT THE SYRACUSE VA MEDICAL CENTER AND THE ROME COMMUNITY BASED OUTPATIENT CLINIC ]]></SUBJECT>
<SOLNBR><![CDATA[9069]]></SOLNBR>
<RESPDATE>11172017</RESPDATE>
<ARCHDATE>12172017</ARCHDATE>
<CONTACT><![CDATA[COiyiyS, JUhhiuN<a href="mailto:Juggyui@va.gov">CONTRACT SPECIALIST</a>]]></CONTACT>
<DESC><![CDATA[This is a Sources Sought Notice. (a) The Government does not intend to award a contract on the basis of this Sources Sought or to otherwise pay for the information solicited.(b) Although "proposal," "offeror," contractor, and "offeror" may be used in this sources sought notice, any response will be treated as information only. It shall not be used as a proposal.Attachment(s) if applicable. ]]></DESC>
<LINK><![CDATA[https://www.fbo.gov/spg/VA/CaVAMC532/CaVAMC532/9069/listing.html]]></LINK>
<EMAIL>
<ADDRESS><![CDATA[Jigjhgjas@va.gov]]></ADDRESS>
<DESC><![CDATA[CONTRACT SPECIALIST]]></DESC>
</EMAIL>
<SETASIDE>N/A</SETASIDE>
<RECOVERY_ACT>N</RECOVERY_ACT>
<DOCUMENT_PACKAGES>
<PACKAGE><![CDATA[Attachment]]></PACKAGE>
</DOCUMENT_PACKAGES>
</SRCSGT>
</NOTICES>
<小时/>

我写的代码

from lxml import etree as et
import pandas as pd

trees = et.parse('test.xml') #get xml file
root = trees.getroot() #get to root of file

tags = [] #list for holding all tags
datas = [] #list for holding all data in tags


for child in root: #root is a list of all elements in the xml file
#print(child.tag)
tt = child.tag #reads xml tag
tags.append(tt)
datas.append(child.text) #read xml tag data
for c in child.findall('./'): # ./ finds children
tt1 = c.tag
tags.append(str(tt1))
datas.append(c.text)
for i in c.findall('./'): #each child node loads a new list of elements
tt2 = i.tag
tags.append(str(tt1)+ '_' + str(tt2))
datas.append(i.text)
for j in i.findall('./'):
tt3 = j.tag
tags.append(str(tt1)+ '_' + str(tt2) + '_' + str(tt3))
datas.append(j.text)
for k in j.findall('./'):
tt4 = k.tag
tags.append(str(tt1)+ '_' + str(tt2) + '_' + str(tt3) + '_' + str(tt4))
datas.append(k.text)

df = pd.DataFrame({"tags": tags,"values": datas})
<小时/>

所需的解决方案是这样的

 date agency office
1/1/10 A1 O1
1/1/10 A2 O2
1/1/10 A3 O3

所以基本上标签应该变成列标题并且必须填充。列名不应重复,以便我可以创建标准数据库表。

最佳答案

考虑嵌套的 xpath 循环,首先循环遍历每个 <SCRSGT>节点,然后使用内部字典提取所有 SCRSGT 的子节点,该字典迭代地追加到 DataFrame 的列表中。调用:

from lxml import etree as et
import pandas as pd

trees = et.parse('test.xml')

d = []
for srcsgt in trees.xpath('//SRCSGT'): # ITERATE THROUGH ROOT'S CHILDREN
inner = {}
for elem in srcsgt.xpath('//*'): # ITERATE THROUGH ROOT'S DESCENDANTS PER CHILD
if len(elem.text.strip()) > 0: # KEEP ONLY NODES WITH NON-ZERO LENGTH TEXT
inner[elem.tag] = elem.text

d.append(inner)

df = pd.DataFrame(d)

输出

print(df)

# ADDRESS AGENCY ARCHDATE CLASSCOD \
# 0 Jigjhgjas@va.gov Department of Veterans Affairs 12172017 H

# CONTACT DATE \
# 0 COiyiyS, JUhhiuN<a href="mailto:Juggyui@va.gov... 11112017

# DESC LINK \
# 0 CONTRACT SPECIALIST https://www.fbo.gov/spg/VA/CaVAMC532/CaVAMC532...

# LOCATION NAICS \
# 0 Department of Veterans Affairs Medical Center 238210

# OFFADD OFFICE \
# 0 Department of Veterans Affairs;400 Fort Hill A... Canandaigua VAMC

# PACKAGE RECOVERY_ACT RESPDATE SETASIDE SOLNBR \
# 0 Attachment N 11172017 N/A 9069

# SUBJECT ZIP
# 0 H--3 YEAR TESTING/MAINTENANCE OF ELECTRICAL EQ... 14424

关于python - 从嵌套的 xml 文件创建 pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47874196/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com