gpt4 book ai didi

python - 从嵌套 xml 创建数据框并生成 csv

转载 作者:行者123 更新时间:2023-12-01 02:07:12 32 4
gpt4 key购买 nike

我有一个像这样的 XML 文件:

<?xml version="1.0"?>
<PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Class Def" MessageType="Integration Object">
<ListOf_Class_Def>
<ImpExp Type="CLASS_DEF" Name="lp_pkg_cla" Object_Num="1001p">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
<Object_Arrt Orig_Id="6666p" Attr_Name="LP_Portable">
</Object_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Class_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Class Def" MessageType="Integration Object">
<ListOf_Class_Def>
<ImpExp Type="CLASS_DEF" Name="M_pkg_cla" Object_Num="1023i">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
<Object_Arrt Orig_Id="7010p" Attr_Name="O_Portable">
</Object_Arrt>
<Object_Arrt Orig_Id="7012j" Attr_Name="O_wireless">
</Object_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Class_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Laptop" Object_Num="2008a">
<ListOfObject_Def>
<Object_Def Ancestor_Num="1001p" Ancestor_Name="lp_pkg_cla">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Mouse" Object_Num="2987d">
<ListOfObject_Def>
<Object_Def Ancestor_Num="1023i" Ancestor_Name="M_pkg_cla">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
<PropertySet NumOutputObjects="1" >
<Message IntObjectName="Prod Def" MessageType="Integration Object">
<ListOf_Prod_Def>
<ImpExp Type="PROD_DEF" Name="Speaker" Object_Num="5463g">
<ListOfObject_Def>
<Object_Def Ancestor_Num="" Ancestor_Name="">
</Object_Def>
</ListOfObject_Def>
<ListOfObject_Arrt>
</ListOfObject_Arrt>
</ImpExp>
</ListOf_Prod_Def>
</Message>
</PropertySet>
</PropertySet>

我希望提取Name , Object_Num , Orig_IdAttr_Name使用 Python 从中提取标签并将其转换为 .csv 格式。

我希望看到的 .csv 格式很简单:

ProductId   Product AttributeId Attribute
2008a Laptop 6666p LP_Portable
2987d Mouse 7010p O_Portable
2987d Mouse 7012p O_Wireless
5463g Speaker "" ""

其实xml标签中有这样的关系:

  1. 所有产品均位于标签“ImpExp Type="PROD_DEF"..”
  2. 所有属性都位于标签“ImpExp Type="CLASS_DEF"..”
  3. 如果产品有属性,那么就有标签
    <Object_Def Ancestor_Num="1023i".. >

  4. Ancestor_Num等于Object_Num在标签中, Type="CLASS_DEF"..

我已经尝试过这个:

from lxml import etree
import pandas
import HTMLParser

inFile = "./newm.xml"
outFile = "./new.csv"

ctx1 = etree.iterparse(inFile, tag=("ImpExp", "ListOfObject_Def", "ListOfObject_Arrt",))


hp = HTMLParser.HTMLParser()
csvData = []
csvData1 = []
csvData2 = []
csvData3 = []
csvData4 = []
csvData5 = []

for event, elem in ctx1:
value1 = elem.get("Type")
value2 = elem.get("Name")
value3 = elem.get("Object_Num")
value4 = elem.get("Ancestor_Num")
value5 = elem.get("Orig_Id")
value6 = elem.get("Attr_Name")
if value1 == "PROD_DEF":
csvData.append(value2)
csvData1.append(value3)
for event, elem in ctx1:
if value4 is not None:
csvData2.append(value4)
elem.clear()

df = pandas.DataFrame({'Product':csvData, 'ProductId':csvData1, 'AncestorId':csvData2})

for event, elem in ctx1:
if value1 == "Class Def":
csvData3.append(value3)
csvData4.append(value5)
csvData5.append(value6)
elem.clear()

df1 = pandas.DataFrame({'AncestorId':csvData3, 'AttribId':csvData4, 'AttribName':csvData5})

dff = pandas.merge(df, df1, on="AncestorId")
dff.to_csv(outFile, index = False)

最佳答案

考虑XSLT ,一种旨在转换 XML 文件的专用语言,可以直接将 XML 转换为 CSV(即文本文件),无需 pandas dataframe 中介。 Python 的第三方模块 lxml(您已经在使用)可以运行 XSLT 1.0 脚本,并且不需要 for 循环或 if 逻辑。然而,由于产品和属性的复杂对齐,一些较长的 XPath 搜索与 XSLT 一起使用。

XSLT (另存为 .xsl 文件,一种特殊的 .xml 文件)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="no" method="text"/>
<xsl:strip-space elements="*"/>

<xsl:param name="delimiter">,</xsl:param>

<xsl:template match="/PropertySet">
<xsl:text>ProductId,Product,AttributeId,Attribute&#xa;</xsl:text>
<xsl:apply-templates select="*"/>
</xsl:template>

<xsl:template match="PropertySet|Message|ListOf_Class_Def|ListOf_Prod_Def|ImpExp">
<xsl:apply-templates select="*"/>
</xsl:template>

<xsl:template match="ListOfObject_Arrt">
<xsl:apply-templates select="Object_Arrt"/>
<xsl:if test="name(*) != 'Object_Arrt' and preceding-sibling::ListOfObject_Def/Object_Def/@Ancestor_Name = ''">
<xsl:value-of select="concat(ancestor::ImpExp/@Name, $delimiter,
ancestor::ImpExp/@Object_Num, $delimiter,
'', $delimiter,
'')"/><xsl:text>&#xa;</xsl:text>
</xsl:if>
</xsl:template>

<xsl:template match="Object_Arrt">
<xsl:variable name="attrName" select="ancestor::ImpExp/@Name"/>
<xsl:value-of select="concat(/PropertySet/PropertySet/Message[@IntObjectName='Prod Def']/ListOf_Prod_Def/
ImpExp[ListOfObject_Def/Object_Def/@Ancestor_Name = $attrName]/@Name, $delimiter,

/PropertySet/PropertySet/Message[@IntObjectName='Prod Def']/ListOf_Prod_Def/
ImpExp[ListOfObject_Def/Object_Def/@Ancestor_Name = $attrName]/@Object_Num, $delimiter,

@Orig_Id, $delimiter,
@Attr_Name)"/><xsl:text>&#xa;</xsl:text>
</xsl:template>

</xsl:stylesheet>

Python

import lxml.etree as et

# LOAD XML AND XSL
xml = et.parse('Input.xml')
xsl = et.parse('XSLT_Script.xsl')

# RUN TRANSFORMATION
transform = et.XSLT(xsl)
result = transform(xml)

# OUTPUT TO FILE
with open('Output.csv', 'wb') as f:
f.write(result)

输出

ProductId,Product,AttributeId,Attribute
Laptop,2008a,6666p,LP_Portable
Mouse,2987d,7010p,O_Portable
Mouse,2987d,7012j,O_wireless
Speaker,5463g,,

关于python - 从嵌套 xml 创建数据框并生成 csv,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48941023/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com