gpt4 book ai didi

python - 将分层信息从 XML 解析为 sqlite

转载 作者:太空宇宙 更新时间:2023-11-03 16:23:42 26 4
gpt4 key购买 nike

我发布的脚本会很长,请您耐心等待。我相信对于熟悉XML结构复杂性的人来说很容易解决。我有一个高度分层的 XML 文件。我很想基于它制作 sqlite 数据库。到目前为止,我拥有的两个文件从具有此结构的 XML 文件中提取同级信息(注意 - 工作频率从 0 到 4 或 5 不等;节点也可以为空):

<program>
<id>28798031</id>
<programID>12345</programID>
<orchestra>New York Philarmonic</orchestra>
<season>1842-43</season>
<concertInfo>
<eventType>Subscription Season</eventType>
<Location>Manhattan, NY</Location>
<Venue>Apollo Rooms</Venue>
<Date>1842-12-07T05:00:00Z</Date>
<Time>8:00PM</Time>
</concertInfo>
<worksInfo>
<work ID="52446*">
<composerName>Beethoven, Ludwig van</composerName>
<workTitle>SYMPHONY NO. 5 IN C MINOR, OP.67</workTitle>
<conductorName>Hill, Ureli Corelli</conductorName>
</work>
<work ID="8834*4">
<composerName>Weber, Carl Maria Von</composerName>
<workTitle>OBERON</workTitle>
<movement>"Ozean, du Ungeheuer" (Ocean, thou mighty monster), Reiza (Scene and Aria), Act II</movement>
<conductorName>Timm, Henry C.</conductorName>
<soloists>
<soloist>
<soloistName>Otto, Antoinette</soloistName>
<soloistInstrument>Soprano</soloistInstrument>
<soloistRoles>S</soloistRoles>
</soloist>
</soloists>
</work>
</worksInfo>
</program>

我有两个脚本可以从兄弟节点(相同的层次结构级别)中提取数据。

第一

import sqlite3
import xml.etree.cElementTree

def insert_program(db_conn, id, programID, orchestra, season, concertInfo, worksInfo):
curs = db_conn.cursor()
curs.execute("insert into program values (?,?,?,?,?,?)", (id, programID, orchestra, season, concertInfo, worksInfo))
db_conn.commit()

def program_data_from_element(element):
id = element.find("id").text
programID = element.find("programID").text
orchestra = element.find("orchestra").text
season = element.find("season").text
concertInfo = element.find("concertInfo").text
worksInfo = element.find("worksInfo").text
return id, programID, orchestra, season, concertInfo, worksInfo

## add the main loop to get all the programs from the XML file
if __name__ == "__main__":
conn = sqlite3.connect("program.sqlite3")
program = xml.etree.cElementTree.parse("complete.xml")
program = program.findall("program")
for index, element in enumerate(program):
id, programID, orchestra, season, concertInfo, worksInfo = program_data_from_element(element)
insert_program(conn, id, programID, orchestra, season, concertInfo, worksInfo)

第二

import sqlite3
import xml.etree.cElementTree

def insert_work(db_conn, workID, composerName, workTitle, movement, conductorName):
curs = db_conn.cursor()
curs.execute("insert into work values (?,?,?,?,?)", (workID, composerName, workTitle, movement, conductorName))
db_conn.commit()

def work_data_from_element(element):
workID = element.get("ID")
if workID != None:
workID = workID
else:
workID = ''
composerName = element.find("composerName")
if composerName != None:
composerName = composerName.text
else:
composerName = ''
workTitle = element.find("workTitle")
if workTitle != None:
workTitle = workTitle.text
else:
workTitle = ''
movement = element.find("movement")
if movement != None:
movement = movement.text
else:
movement = ''
conductorName = element.find("conductorName")
if conductorName != None:
conductorName = conductorName.text
else:
conductorName = ''
return workID, composerName, workTitle, movement, conductorName


## add the main loop to get all the work information from the XML file
if __name__ == "__main__":
conn = sqlite3.connect("work.sqlite3")
programs = xml.etree.cElementTree.parse("complete.xml")
work = programs.findall("program/worksInfo/work")
for index, element in enumerate(work):
workID, composerName, workTitle, movement, conductorName = work_data_from_element(element)
insert_work(conn, workID, composerName, workTitle, movement, conductorName)

我的问题是——如何将两者连接在一起,以便我可以同时知道节目 ID、季节、工作 ID 和指挥名称等?谢谢!!

最佳答案

每当使用需要扁平化为二维格式(如数据库表)的嵌套分层 XML 文件时,请考虑使用 XSLT 解决方案。据了解,XSLT是一种专用编程语言,旨在转换 XML 文件,与大多数通用语言一样,Python 可以处理 XSLT 1.0 脚本,特别是使用 lxml模块。

因此,对于 programwork SQLite 表,请考虑运行以下 XSLT 脚本来转换源 XML 文件,然后使用 XPath(XSLT 的同级)解析数据库将所有内容导入一个 Python 脚本中。顺便说一句,XSLT 是格式良好的 XML 文件,因此可以像任何其他 XML 一样从文件或字符串加载。 XSLT 的优点还在于您可以对可能存在/可能不存在的元素进行硬编码,以便为丢失的节点返回空文本。

对结构进行一些更改:

  1. 连接 <concerntInfo> 的所有子项(也许用作单独的表格?)
  2. 已删除 <worksInfo>来自程序解析,因为它反射(reflect)在工作
  3. 添加了相应的程序<id> 工作解析以用作表中的外键
<小时/>

编写XSLT (另存为.xsl文件以便在Python中调用)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

<xsl:template match="/">
<program>
<xsl:apply-templates select="program"/>
</program>
</xsl:template>

<xsl:template match="program">
<id><xsl:value-of select="id"/></id>
<programID><xsl:value-of select="programID"/></programID>
<orchestra><xsl:value-of select="orchestra"/></orchestra>
<season><xsl:value-of select="season"/></season>
<concertInfo><xsl:value-of select="concat(concertInfo/eventType, ' ',
concertInfo/Location, ' ', concertInfo/Venue, ' ',
concertInfo/Date, ' ', concertInfo/Time)"/></concertInfo>
</xsl:template>

</xsl:transform>

工作 XSLT (保存为 .xsl 以在 Python 中调用)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

<xsl:template match="program">
<data>
<xsl:apply-templates select="worksInfo"/>
</data>
</xsl:template>

<xsl:template match="worksInfo">
<xsl:apply-templates select="work"/>
</xsl:template>

<xsl:template match="work">
<xsl:copy>
<programID><xsl:value-of select="ancestor::program/id"/></programID>
<workID><xsl:value-of select="@workID"/></workID>
<composerName><xsl:value-of select="composerName"/></composerName>
<workTitle><xsl:value-of select="workTitle"/></workTitle>
<movement><xsl:value-of select="movement"/></movement>
<conductorName><xsl:value-of select="conductorName"/></conductorName>
</xsl:copy>
</xsl:template>

</xsl:transform>

Python 脚本

import lxml.etree as ET
import sqlite3

def insert_program(db_conn, id, programID, orchestra, season, concertInfo):
curs = db_conn.cursor()
curs.execute("insert into program values (?,?,?,?,?,?)", (id, programID, orchestra, season, concertInfo))
db_conn.commit()

def program_data_from_element(element):
id = element.find("id").text
programID = element.find("programID").text
orchestra = element.find("orchestra").text
season = element.find("season").text
concertInfo = element.find("concertInfo").text

return id, programID, orchestra, season, concertInfo

def insert_work(db_conn, programID, workID, composerName, workTitle, movement, conductorName):
curs = db_conn.cursor()
curs.execute("insert into work values (?,?,?,?,?)", (programID, workID, composerName, workTitle, movement, conductorName))
db_conn.commit()

def work_data_from_element(element):
programID = element.find("programID").text
workID = element.find("workID").text
composerName = element.find("composerName").text
workTitle = element.find("workTitle").text
movement = element.find("movement").text
conductorName = element.find("conductorName").text

return programID, workID, composerName, workTitle, conductorName, movement

if __name__ == "__main__":
conn = sqlite3.connect("program.sqlite3")
xml = ET.parse("complete.xml")

# PROGRAM PARSE
xslt = ET.parse("program.xsl")
transform = ET.XSLT(xslt)
newdom = transform(xml)
program = newdom.xpath("//program")

for index, element in enumerate(program):
id, programID, orchestra, season, concertInfo = program_data_from_element(element)
insert_program(conn, id, programID, orchestra, season, concertInfo)

# WORK PARSE
xslt = ET.parse("work.xsl")
transform = ET.XSLT(xslt)
newdom = transform(xml)
work = newdom.xpath("//work")

for index, element in enumerate(work):
programID, workID, composerName, workTitle, conductorName, movement = work_data_from_element(element)
insert_work(conn, programID, workID, composerName, workTitle, conductorName, movement)

关于python - 将分层信息从 XML 解析为 sqlite,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38187781/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com