python - 数据框到分层xml-6ren

python - 数据框到分层xml

转载作者：行者123 更新时间：2023-12-01 07:21:20

将csv读取到dataframe，然后使用lxml库将其转换为xml

这是我第一次处理xml，看来部分成功。任何帮助将不胜感激。

用于创建数据框的CSV文件：

Parent,Element,Text,Attribute
,TXLife,"
    ",{'Version': '2.25.00'}
TXLife,UserAuthRequest,"
        ",{}
UserAuthRequest,UserLoginName,*****,{}
UserAuthRequest,UserPswd,"
            ",{}
UserPswd,CryptType,None,{}
UserPswd,Pswd,****,{}
TXLife,TXLifeRequest,"
        ",{'PrimaryObjectID': 'Policy_1'}
TXLifeRequest,TransRefGUID,706D67C1-CC4D-11CF-91FB444554540000,{}
TXLifeRequest,TransType,Holding Change,{'tc': '502'}
TXLifeRequest,TransExeDate,2006-11-19,{}
TXLifeRequest,TransExeTime,13:15:33-07:00,{}
TXLifeRequest,ChangeSubType,"
            ",{}
ChangeSubType,ChangeTC,Change Participant,{'tc': '9'}
TXLifeRequest,OLifE,"
            ",{}
OLifE,Holding,"
                ",{'id': 'Policy_1'}
Holding,HoldingTypeCode,Policy,{'tc': '2'}
Holding,Policy,"
                    ",{}
Policy,PolNumber,1234567,{}
Policy,LineOfBusiness,Annuity,{'tc': '2'}
Policy,Annuity,,{}
OLifE,Party,"
                ",{'id': 'Beneficiary_1'}
Party,PartyTypeCode,Organization,{'tc': '2'}
Party,FullName,The Smith Trust,{}
Party,Organization,"
                    ",{}
Organization,OrgForm,Trust,{'tc': '16'}
Organization,DBA,The Smith Trust,{}
OLifE,Relation,"
                ","{'id': 'Relation_1', 'OriginatingObjectID': 'Policy_1', 'RelatedObjectID': 'Beneficiary_1'}"
Relation,OriginatingObjectType,Holding,{'tc': '4'}
Relation,RelatedObjectType,Party,{'tc': '6'}
Relation,RelationRoleCode,Primary Beneficiary,{'tc': '34'}
Relation,BeneficiaryDesignation,Named,{'tc': '1'}

import lxml.etree as etree
import pandas as pd
import json

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv') .fillna('NA')
# # Remove rows with comments
# dfc = dfc[~dfc['Element'].str.contains("<cyfunction")].fillna('')
dfc['Attribute'] = dfc['Attribute'].apply(lambda x: x.replace("'", '"'))

# Add the root element for xml
root = etree.Element(dfc['Element'][0])
tree = root.getroottree()

for prnt, elem, txt, attr in dfc[['Parent', 'Element', 'Text', 'Attribute']][1:].values:
    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # list(root) = root.getchildren()
    children = [item for item in str(list(root)).split(' ')]
    rootstring = str(root).split(' ')[1]

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == str(root).split(' ')[1]:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif not prnt == rootstring and prnt in children:
        child = etree.SubElement(parent, elem, attrib).text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [str(item).split(' ') for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib).text = txt

print(etree.tostring(tree, pretty_print=True).decode())

实际结果：

<TXLife>
  <UserAuthRequest>
    <UserLoginName>*****</UserLoginName>
    <UserPswd>
            </UserPswd>
    <CryptType>None</CryptType>
    <Pswd>xxxxxx</Pswd>
  </UserAuthRequest>
  <TXLifeRequest>
    <TransRefGUID>706D67C1-CC4D-11CF-91FB444554540000</TransRefGUID>
    <TransType tc="502">Holding Change</TransType>
    <TransExeDate>11/19/2006</TransExeDate>
    <TransExeTime>13:15:33-07:00</TransExeTime>
    <ChangeSubType>
            </ChangeSubType>
    <ChangeTC tc="9">Change Participant</ChangeTC>
    <OLifE>
            </OLifE>
    <Holding id="Policy_1">
                </Holding>
    <HoldingTypeCode tc="2">Policy</HoldingTypeCode>
    <Policy>
                    </Policy>
    <PolNumber>1234567</PolNumber>
    <LineOfBusiness tc="2">Annuity</LineOfBusiness>
    <Annuity>NA</Annuity>
    <Party id="Beneficiary_1">
                </Party>
    <PartyTypeCode tc="2">Organization</PartyTypeCode>
    <FullName>The Smith Trust</FullName>
    <Organization>
                    </Organization>
    <OrgForm tc="16">Trust</OrgForm>
    <DBA>The Smith Trust</DBA>
    <Relation OriginatingObjectID="Policy_1" RelatedObjectID="Beneficiary_1" id="Relation_1">
                </Relation>
    <OriginatingObjectType tc="4">Holding</OriginatingObjectType>
    <RelatedObjectType tc="6">Party</RelatedObjectType>
    <RelationRoleCode tc="34">Primary Beneficiary</RelationRoleCode>
    <BeneficiaryDesignation tc="1">Named</BeneficiaryDesignation>
  </TXLifeRequest>
</TXLife>

所需结果：

<TXLife Version="2.25.00">
    <UserAuthRequest>
        <UserLoginName>*****</UserLoginName>
        <UserPswd>
            <CryptType>None</CryptType>
            <Pswd>****</Pswd>
        </UserPswd>
    </UserAuthRequest>
    <TXLifeRequest PrimaryObjectID="Policy_1">
        <TransRefGUID>706D67C1-CC4D-11CF-91FB444554540000</TransRefGUID>
        <TransType tc="502">Holding Change</TransType>
        <TransExeDate>2006-11-19</TransExeDate>
        <TransExeTime>13:15:33-07:00</TransExeTime>
        <ChangeSubType>
            <ChangeTC tc="9">Change Participant</ChangeTC>
        </ChangeSubType>
        <OLifE>
            <Holding id="Policy_1">
                <HoldingTypeCode tc="2">Policy</HoldingTypeCode>
                <Policy>
                    <PolNumber>1234567</PolNumber>
                    <LineOfBusiness tc="2">Annuity</LineOfBusiness>
                    <Annuity></Annuity>
                </Policy>
            </Holding>
            <Party id="Beneficiary_1">
                <PartyTypeCode tc="2">Organization</PartyTypeCode>
                <FullName>The Smith Trust</FullName>
                <Organization>
                    <OrgForm tc="16">Trust</OrgForm>
                    <DBA>The Smith Trust</DBA>
                </Organization>
            </Party>
            <Relation id="Relation_1" OriginatingObjectID="Policy_1" RelatedObjectID="Beneficiary_1">
                <OriginatingObjectType tc="4">Holding</OriginatingObjectType>
                <RelatedObjectType tc="6">Party</RelatedObjectType>
                <RelationRoleCode tc="34">Primary Beneficiary</RelationRoleCode>
                <BeneficiaryDesignation tc="1">Named</BeneficiaryDesignation>
            </Relation>
        </OLifE>
    </TXLifeRequest>
</TXLife>

如上所示，如何获得分层结果？

最佳答案

您已经有了一个不错的开始！认为逐位检查代码并解释需要调整的地方，并提出一些改进建议是最容易的：

读取和清理数据

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv').fillna('NA')
# # Remove rows with comments
# dfc = dfc[~dfc['Element'].str.contains("<cyfunction")].fillna('')
dfc['Attribute'] = dfc['Attribute'].apply(lambda x: x.replace("'", '"'))

.apply可以正常工作，但是还有一个 .str.replace()方法可以使用，它会更加整洁和清晰（ .str可让您将列的值视为字符串类型并相应地对其进行操作）。

添加根

# Add the root element for xml
root = etree.Element(dfc['Element'][0])
tree = root.getroottree()

一切都很好！

遍历行

for prnt, elem, txt, attr in dfc[['Parent', 'Element', 'Text', 'Attribute']][1:].values:

由于无论如何都检索所有列，因此无需索引到 dfc即可选择它们，因此可以删除该部分：

for prnt, elem, txt, attr in dfc[1:].values:

这很好用，但是有内置的方法可以遍历DataFrame中的项目，我们可以使用 itertuples()。这会为每行返回一个 NamedTuple，其中将索引（基本上是行号）作为元组的第一项，因此我们需要对此进行调整：

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():

设置变量

    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # list(root) = root.getchildren()
    children = [item for item in str(list(root)).split(' ')]
    rootstring = str(root).split(' ')[1][1:].values:

这是一个很好的技巧，可以早些时候用双引号替换单引号，以便我们可以使用 json将属性转换成字典。
每个 Element都有一个 .tag属性，我们可以使用该属性来获取名称，这就是我们想要的名称：

children = [item.tag for item in root]
rootstring = root.tag

list(root)或 root.getchildren()都可以为我们提供 root子元素的列表，但是我们也可以像这样使用 for ... in和 root遍历它们。

将元素添加到树中

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == str(root).split(' ')[1]:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif not prnt == rootstring and prnt in children:
        child = etree.SubElement(parent, elem, attrib).text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [str(item).split(' ') for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib).text = txt

str(root).split(' ')[1]正是我们将 rootstring设置为上方的内容，因此我们可以改用它
由于我们已经在第一个 prnt == rootstring语句中检查了 if，因此，如果我们到达了第一个 elif，我们知道它不可能相等，因此我们无需再次检查它
当我们创建孩子时，我们一次有两个任务……以某种方式成功地创建了带有文本的孩子（！），但这意味着 child被设置为 text而不是新的 SubElement。最好分两个步骤执行此操作。
当寻找父项时，当前正在创建一个列表列表（ split()返回一个列表），因此它将无法正常工作。我们需要item标签。

进行所有这些更改将使我们：

#     If the parent is root then add the element as child (appaers to work?)
    if prnt == rootstring:
        parent = etree.SubElement(root, elem)

    # If the parent is not root but is one of its children then add the elements to the parent
    elif prnt in children:
        child = etree.SubElement(parent, elem, attrib)
        child.text = txt

#     # If the parent is not in root's descendents then add the childern to the parents
    elif not prnt in [item.tag for item in root.iterdescendants()]:
        child = etree.SubElement(parent, elem, attrib)
        child.text = txt

但是这里有两个问题。

第一部分（ if语句）可以。

在第二部分（第一个 elif语句）中，我们检查新元素的父级是否是root的子级之一。如果是，我们将新元素添加为 parent的子元素。 parent绝对是root的孩子之一，但实际上我们尚未检查它是否正确。这只是我们添加到 root的最后一件事。幸运的是，由于我们的CSV具有按顺序排列的所有元素，因此这是正确的元素，但是最好对此做得更明确。

在第三部分（第二个 elif）中，最好检查树后面的 prnt是否已经存在。但是目前，如果 prnt不存在，我们将在jusr中将元素添加到 parent中，这不是它的实际父级！如果确实存在 prnt，那么我们根本就不会添加该元素（因此在这里我们需要一个 else子句）。

解

值得庆幸的是，有一个简单的解决方法：我们可以使用 .find()在树中的任意位置找到 prnt元素，然后在其中添加新元素。这也使整个过程变短了！

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
    # Convert attributes to json (dictionary)
    attrib = json.loads(attr)
    # Find parent element
    if prnt == root.tag:
        parent = root
    else:
        parent = root.find(".//" + prnt)
    child = etree.SubElement(parent, elem, attrib)
    child.text = txt

.//中的 root.find(".//" + prnt)表示它将在树中的任何位置搜索匹配的元素标签（在此处了解更多信息： https://lxml.de/tutorial.html#elementpath）。

最终脚本

import lxml.etree as etree
import pandas as pd
import json

# Read the csv file
dfc = pd.read_csv('test_data_txlife.csv').fillna("NA")
dfc['Attribute'] = dfc['Attribute'].str.replace("'", '"').apply(lambda s: json.loads(s))

# Add the root element for xml
root = etree.Element(dfc['Element'][0], dfc['Attribute'][0])

for idx, prnt, elem, txt, attr in dfc[1:].itertuples():
    # Fix text
    text = txt.strip()
    if not text:
        text = None
    # Find parent element
    if prnt == root.tag:
        parent = root
    else:
        parent = root.find(".//" + prnt)
    # Create element
    child = etree.SubElement(parent, elem, attr)
    child.text = text

xml_string = etree.tostring(root, pretty_print=True).decode().replace(">NA<", "><")
print(xml_string)

我做了几处更改：

在更改引号时，我将属性字典的 json.loads位上移了，最后使用 apply将其添加到了末尾。我们在那里需要它，以便在创建根元素时可以使用字典。
要使漂亮的打印正常工作会有一些问题，这就是“修复文本”部分的目的（有关我遇到的问题，请参见 this Stack Overflow question）。
拥有 .fillna("")（用空字符串填充）是最明智的选择，但是如果这样做，我们最终会以 </Annuity>而不是 <Annuity></Annuity>（这是合法的XML-如果您的元素没有文本或子元素，您可以只做结束标记）。但是要使它如我们所愿地发布，我们需要它具有一些“内容”，以便创建开始标签。因此，我将其保留为 .fillna("NA")，然后将其保留在末尾，手动将其替换为输出字符串。

还很高兴知道此脚本（至少）对输入数据做出了四个假设：

父元素是在其任何子元素之前创建的（即，它们在CSV文件中更远的位置出现）
元素名称是唯一的（或者至少任何重复的名称没有任何子代，因此我们永远不会在可能存在多个匹配项的情况下执行 .find()； .find()始终返回第一个匹配项）
在最终的XML中没有要保留的'NA'文本值（当我们从 Annuity元素中删除虚假的'NA'文本时，它们也会被删除）
仅包含空格的文本无需保留

关于python - 数据框到分层xml，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57682123/

文章推荐： python - 为什么 tensorflow reshape 数组超出范围

文章推荐： Python 3.x 使用众数填充缺失的 NaN 值

文章推荐： python - pandas 阻止我下载我不想拥有的文件

文章推荐： python - 返回特定整数搜索的数据帧行

python - Python 中的集群或合并集群以减少组数 (Python)
我正在处理一组标记为 160 个组的 173k 点。我想通过合并最接近的(到 9 或 10 个组)来减少组/集群的数量。我搜索过 sklearn 或类似的库，但没有成功。我猜它只是通过 knn 聚类
python - python 列表的子集基于同一列表的元素组，pythonically
我有一个扁平数字列表，这些数字逻辑上以 3 为一组，其中每个三元组是 (number, __ignored, flag[0 or 1])，例如: [7,56,1, 8,0,0, 2,0,0, 6,1,
python - 激活 Python 虚拟环境并在另一个 Python 脚本中调用 Python 脚本
我正在使用 pipenv 来管理我的包。我想编写一个 python 脚本来调用另一个使用不同虚拟环境(VE)的 python 脚本。如何运行使用 VE1 的 python 脚本 1 并调用另一个 p
python - 在焕然一新的 Python 环境中以编程方式从 Python 内部执行 Python 文件
假设我有一个文件 script.py 位于 path = "foo/bar/script.py"。我正在寻找一种在 Python 中通过函数 execute_script() 从我的主要 Python
python - 从 python 脚本但在 python 脚本之外运行 python 脚本
这听起来像是谜语或笑话，但实际上我还没有找到这个问题的答案。问题到底是什么？我想运行 2 个脚本。在第一个脚本中，我调用另一个脚本，但我希望它们继续并行，而不是在两个单独的线程中。主要是我不希望第
python - 使用不同的 python 从 python 运行 python 脚本
我有一个带有 python 2.5.5 的软件。我想发送一个命令，该命令将在 python 2.7.5 中启动一个脚本，然后继续执行该脚本。我试过用 #!python2.7.5 和http://re
python - 为什么从 Python 命令行调用 Python 时 Python 无法找到并运行我的脚本？
我在 python 命令行(使用 python 2.7)中，并尝试运行 Python 脚本。我的操作系统是 Windows 7。我已将我的目录设置为包含我所有脚本的文件夹，使用: os.chdir("
python - 使用动态版本的 Python 执行嵌入的 Python 代码时出现致命的 Python 错误
剧透:部分解决(见最后)。以下是使用 Python 嵌入的代码示例: #include int main(int argc, char** argv) { Py_SetPythonHome
python - python 中识别 python 数组或列表中最大累积差异的最快方法是什么？
假设我有以下列表，对应于及时的股票价格: prices = [1, 3, 7, 10, 9, 8, 5, 3, 6, 8, 12, 9, 6, 10, 13, 8, 4, 11] 我想确定以下总体上最
python - (Python) 通过单选按钮 python 更新背景
所以我试图在选择某个单选按钮时更改此框架的背景。我的框架位于一个类中，并且单选按钮的功能位于该类之外。 (这样我就可以在所有其他框架上调用它们。) 问题是每当我选择单选按钮时都会出现以下错误: co
python - python 中的字符串与正则表达式比较在 python 中失败
我正在尝试将字符串与 python 中的正则表达式进行比较，如下所示， #!/usr/bin/env python3 import re str1 = "Expecting property name
python - python 如何加载Boost.Python 库？
考虑以下原型(prototype) Boost.Python 模块，该模块从单独的 C++ 头文件中引入类“D”。 /* file: a/b.cpp */ BOOST_PYTHON_MODULE(c)
python - python 检查模块 python 的问题
如何编写一个程序来“识别函数调用的行号？” python 检查模块提供了定位行号的选项，但是， def di(): return inspect.currentframe().f_back.f_l
python - 系统 python 与用户 python
我已经使用 macports 安装了 Python 2.7，并且由于我的 $PATH 变量，这就是我输入 $ python 时得到的变量。然而，virtualenv 默认使用 Python 2.6，除
python - [Python] : Python re. 长字符串行的搜索速度优化
我只想问如何加快 python 上的 re.search 速度。我有一个很长的字符串行，长度为 176861(即带有一些符号的字母数字字符)，我使用此函数测试了该行以进行研究: def getExe
python - 编辑字符串 python 正则表达式 python
list1= [u'%app%%General%%Council%', u'%people%', u'%people%%Regional%%Council%%Mandate%', u'%ppp%%Ge
python - Python 映射中的副作用(Python "do" block )
这个问题在这里已经有了答案: Is it Pythonic to use list comprehensions for just side effects? (7 个答案) 关闭 4 个月前。告
python - 使用其值逻辑组合两个 python 列表 - Python
我想用 Python 将两个列表组合成一个列表，方法如下: a = [1,1,1,2,2,2,3,3,3,3] b= ["Sun", "is", "bright", "June","and" ,"Ju
python - Boost.Python python 链接错误
我正在运行带有最新 Boost 发行版 (1.55.0) 的 Mac OS X 10.8.4 (Darwin 12.4.0)。我正在按照说明 here构建包含在我的发行版中的教程 Boost-Pyth
python - 在 Python 中仅使用内置库制作一个基本的网络抓取工具 - Python
学习 Python，我正在尝试制作一个没有任何第 3 方库的网络抓取工具，这样过程对我来说并没有简化，而且我知道我在做什么。我浏览了一些在线资源，但所有这些都让我对某些事情感到困惑。 html 看起来

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 数据框到分层xml