gpt4 book ai didi

python - 嵌套的 xml 文件到 pandas 数据框

转载 作者:太空宇宙 更新时间:2023-11-04 00:20:04 24 4
gpt4 key购买 nike

我在解析我的 XML 文件以转换为 Pandas 数据帧时遇到问题。下面是一个示例条目:

<p>


<persName id="t17200427-2-defend31" type="defendantName">
Alice
Jones
<interp inst="t17200427-2-defend31" type="surname" value="Jones"/>
<interp inst="t17200427-2-defend31" type="given" value="Alice"/>
<interp inst="t17200427-2-defend31" type="gender" value="female"/>
</persName>

, of <placeName id="t17200427-2-defloc7">St. Michael's Cornhill</placeName>
<interp inst="t17200427-2-defloc7" type="placeName" value="St. Michael's Cornhill"/>
<interp inst="t17200427-2-defloc7" type="type" value="defendantHome"/>
<join result="persNamePlace" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-defloc7"/>, was indicted for <rs id="t17200427-2-off8" type="offenceDescription">
<interp inst="t17200427-2-off8" type="offenceCategory" value="theft"/>
<interp inst="t17200427-2-off8" type="offenceSubcategory" value="shoplifting"/>
privately stealing a Bermundas Hat, value 10 s. out of the Shop of

<persName id="t17200427-2-victim33" type="victimName">
Edward
Hillior
<interp inst="t17200427-2-victim33" type="surname" value="Hillior"/>
<interp inst="t17200427-2-victim33" type="given" value="Edward"/>
<interp inst="t17200427-2-victim33" type="gender" value="male"/>
<join result="offenceVictim" targOrder="Y" targets="t17200427-2-off8 t17200427-2-victim33"/>
</persName>



</rs> , on the <rs id="t17200427-2-cd9" type="crimeDate">21st of April</rs>
<join result="offenceCrimeDate" targOrder="Y" targets="t17200427-2-off8 t17200427-2-cd9"/> last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her <rs id="t17200427-2-verdict10" type="verdictDescription">
<interp inst="t17200427-2-verdict10" type="verdictCategory" value="guilty"/>
<interp inst="t17200427-2-verdict10" type="verdictSubcategory" value="theftunder1s"/>
Guilty to the value of 10 d.
</rs>
<rs id="t17200427-2-punish11" type="punishmentDescription">
<interp inst="t17200427-2-punish11" type="punishmentCategory" value="transport"/>
<join result="defendantPunishment" targOrder="Y" targets="t17200427-2-defend31 t17200427-2-punish11"/>
Transportation
</rs> .</p>

我想要一个包含性别、犯罪和审判文本列的数据框。我之前已将所有数据提取到数据框中,但无法获取

标签之间的文本。

这是一个示例代码:

def table_of_cases(xml_file_name):
file = ET.ElementTree(file = xml_file_name)
iterate = file.getiterator()
i = 1
table = pd.DataFrame()
for element in iterate:
if element.tag == "persName":
t = element.attrib['type']
try:
val = [element.attrib['value']]
if t not in labels:
table[t] = val
elif t+num not in labels:
table[t+num] = val
elif t+num in labels:
num = str(i+1)
table[t+num] = val
except Exception:
pass
labels = list(table.columns.values)
num = str(i)

return table

** 我有大约 1,000 多个相同 XML 格式的文件要制作成一个数据框

最佳答案

因为您的 XML 非常复杂,文本值会跨节点溢出,请考虑 XSLT ,一种专门用于将特别复杂的 XML 文件转换为更简单的文件的专用语言。

Python 的第三方模块,lxml , 可以运行 XSLT 1.0 甚至 XPath 1.0 来解析转换后的结果以迁移到 pandas数据框。此外,您可以使用外部 XSLT processors Python 可以用 subprocess 调用.

具体来说,下面的 XSLT 使用 XPath 的 descendant::* 从被告和受害人以及整个段落的文本值中提取必要的属性。从根开始,假设 <p>是它的 child 。

XSLT (另存为.xsl文件,一种特殊的.xml文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes" method="xml"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/*">
<xsl:apply-templates select="p"/>
</xsl:template>

<xsl:template match="p">
<data>
<defendantName><xsl:value-of select="normalize-space(descendant::persName[@type='defendantName'])"/></defendantName>
<defendantGender><xsl:value-of select="descendant::persName[@type='defendantName']/interp[@type='gender']/@value"/></defendantGender>
<offenceCategory><xsl:value-of select="descendant::interp[@type='offenceCategory']/@value"/></offenceCategory>
<offenceSubCategory><xsl:value-of select="descendant::interp[@type='offenceSubcategory']/@value"/></offenceSubCategory>

<victimName><xsl:value-of select="normalize-space(descendant::persName[@type='victimName'])"/></victimName>
<victimGender><xsl:value-of select="descendant::persName[@type='victimName']/interp[@type='gender']/@value"/></victimGender>
<verdictCategory><xsl:value-of select="descendant::interp[@type='verdictCategory']/@value"/></verdictCategory>
<verdictSubCategory><xsl:value-of select="descendant::interp[@type='verdictSubcategory']/@value"/></verdictSubCategory>
<punishmentCategory><xsl:value-of select="descendant::interp[@type='punishmentCategory']/@value"/></punishmentCategory>

<trialText><xsl:value-of select="normalize-space(/p)"/></trialText>
</data>
</xsl:template>

</xsl:stylesheet>

python

import lxml.etree as et
import pandas as pd

# LOAD XML AND XSL
doc = et.parse("Source.xml")
xsl = et.parse("XSLT_Script.xsl")

# RUN TRANSFORMATION
transformer = et.XSLT(xsl)
result = transformer(doc)

# OUTPUT TO CONSOLE
print(result)

data = []
for i in result.xpath('/*'):
inner = {}
for j in i.xpath('*'):
inner[j.tag] = j.text

data.append(inner)

trial_df = pd.DataFrame(data)

print(trial_df)

对于 1,000 个相似的 XML 文件,循环执行此过程并将每个单行 trial_df 数据帧附加到列表中以与 pd.concat 堆叠.

XML 输出

<?xml version="1.0"?>
<data>
<defendantName>Alice Jones</defendantName>
<defendantGender>female</defendantGender>
<offenceCategory>theft</offenceCategory>
<offenceSubCategory>shoplifting</offenceSubCategory>
<victimName>Edward Hillior</victimName>
<victimGender>male</victimGender>
<verdictCategory>guilty</verdictCategory>
<verdictSubCategory>theftunder1s</verdictSubCategory>
<punishmentCategory>transport</punishmentCategory>
<trialText>Alice Jones , of St. Michael's Cornhill, was indicted for privately stealing a Bermundas Hat, value 10 s. out of the Shop of Edward Hillior , on the 21st of April last. The Prosecutor's Servant deposed that the Prisner came into his Master's Shop and ask'd for a Hat of about 10 s. price; that he shewed several, and at last they agreed for one; but she said it was to go into the Country, and that she would stop into Bishopsgate-street. and if the Coach was not gone she would come and fetch it; that she went out of the Shop but he perceiving she could hardly walk fetcht her back again, and the Hat mentioned in the Indictment fell from between her Legs. Another deposed that he saw the former Evidence take the Hat from under her Petticoats. The Prisoner denyed the Fact, and called two Persons to her Reputation, who gave her a good Character, and said that she rented a House of 10 l. a Year in Petty France, at Westminster, but she had told the Justice that she liv'd in King-Street. The Jury considering the whole matter, found her Guilty to the value of 10 d. Transportation .</trialText>
</data>

数据框输出

#   defendantGender defendantName offenceCategory offenceSubCategory  \
# 0 female Alice Jones theft shoplifting

# punishmentCategory trialText \
# 0 transport Alice Jones , of St. Michael's Cornhill, was i...

# verdictCategory verdictSubCategory victimGender victimName
# 0 guilty theftunder1s male Edward Hillior

关于python - 嵌套的 xml 文件到 pandas 数据框,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49439081/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com