gpt4 book ai didi

Python解析非标准XML文件

转载 作者:太空狗 更新时间:2023-10-29 23:55:50 25 4
gpt4 key购买 nike

我的输入文件实际上是多个 XML 文件附加到一个文件。 (来自 Google Patents )。它具有以下结构:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>

Python xml.dom.minidom 无法解析这个非标准文件。解析此文件的更好方法是什么?我不是下面的代码是否具有良好的性能。

for line in infile:
if line == '<?xml version="1.0" encoding="UTF-8"?>':
xmldoc = minidom.parse(XMLstring)
else:
XMLstring += line

最佳答案

这是我的看法,使用生成器和 lxml.etree。提取信息纯属举例。

import urllib2, os, zipfile
from lxml import etree

def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
buff = []
for line in data:
if separator(line):
if buff:
yield ''.join(buff)
buff[:] = []
buff.append(line)
yield ''.join(buff)

def first(seq,default=None):
"""Return the first item from sequence, seq or the default(None) value"""
for item in seq:
return item
return default

datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]

if not os.path.exists(filename):
with open(filename,'wb') as file_write:
r = urllib2.urlopen(datasrc)
file_write.write(r.read())

zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None

count = 0
for item in xmlSplitter(zf.open(xml_file)):
count += 1
if count > 10: break
doc = etree.XML(item)
docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
title = first(doc.xpath('//invention-title/text()'))
assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
print "DocID: {0}\nTitle: {1}\nAssignee: {2}\n".format(docID,title,assignee)

产量:

DocID:    US-D0629996-S1-20110104Title:    Glove backhandAssignee: Blackhawk Industries Product Group Unlimited LLCDocID:    US-D0629997-S1-20110104Title:    Belt sleeveAssignee: NoneDocID:    US-D0629998-S1-20110104Title:    UnderwearAssignee: X-Technology Swiss GmbHDocID:    US-D0629999-S1-20110104Title:    Portion of compression shortsAssignee: Nike, Inc.DocID:    US-D0630000-S1-20110104Title:    ApparelAssignee: NoneDocID:    US-D0630001-S1-20110104Title:    Hooded shirtAssignee: NoneDocID:    US-D0630002-S1-20110104Title:    Hooded shirtAssignee: NoneDocID:    US-D0630003-S1-20110104Title:    Hooded shirtAssignee: NoneDocID:    US-D0630004-S1-20110104Title:    Headwear capAssignee: NoneDocID:    US-D0630005-S1-20110104Title:    FootwearAssignee: Vibram S.p.A.

关于Python解析非标准XML文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7335560/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com