gpt4 book ai didi

python - 是否有更简单的方法将 xml 文件解析为嵌套数组?

转载 作者:太空宇宙 更新时间:2023-11-04 02:27:48 25 4
gpt4 key购买 nike

给定一个输入文件,例如

<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>

期望的结果是一个嵌套的字典,它存储:

/setid
/docid
/segid
text

我一直在使用 defaultdict 并使用 BeautifulSoup 和嵌套循环读取 xml 文件,即

from io import StringIO
from collections import defaultdict

from bs4 import BeautifulSoup

srcfile = """<srcset setid="newstest2015" srclang="any">
<doc sysid="ref" docid="1012-bbc" genre="news" origlang="en">
<p>
<seg id="1">India and Japan prime ministers meet in Tokyo</seg>
<seg id="2">India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.</seg>
<seg id="3">Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.</seg>
<seg id="4">High on the agenda are plans for greater nuclear co-operation.</seg>
<seg id="5">India is also reportedly hoping for a deal on defence collaboration between the two nations.</seg>
</p>
</doc>
<doc sysid="ref" docid="1018-lenta.ru" genre="news" origlang="ru">
<p>
<seg id="1">FANO Russia will hold a final Expert Session</seg>
<seg id="2">The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.</seg>
<seg id="3">The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.</seg>
<seg id="4">At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.</seg>
</p>
</doc>
<srcset>"""

#ntok = NISTTokenizer()

eval_docs = defaultdict(lambda: defaultdict(dict))

with StringIO(srcfile) as fin:
bsoup = BeautifulSoup(fin.read(), 'html5lib')
setid = bsoup.find('srcset')['setid']
for doc in bsoup.find_all('doc'):
docid = doc['docid']
for seg in doc.find_all('seg'):
segid = seg['id']
eval_docs[setid][docid][segid] = seg.text

[输出]:

>>> eval_docs

defaultdict(<function __main__.<lambda>>,
{'newstest2015': defaultdict(dict,
{'1012-bbc': {'1': 'India and Japan prime ministers meet in Tokyo',
'2': "India's new prime minister, Narendra Modi, is meeting his Japanese counterpart, Shinzo Abe, in Tokyo to discuss economic and security ties, on his first major foreign visit since winning May's election.",
'3': 'Mr Modi is on a five-day trip to Japan to strengthen economic ties with the third largest economy in the world.',
'4': 'High on the agenda are plans for greater nuclear co-operation.',
'5': 'India is also reportedly hoping for a deal on defence collaboration between the two nations.'},
'1018-lenta.ru': {'1': 'FANO Russia will hold a final Expert Session',
'2': 'The Federal Agency of Scientific Organizations (FANO Russia), in joint cooperation with RAS, will hold the third Expert Session on “Evaluating the effectiveness of activities of scientific organizations”.',
'3': 'The gathering will be the final one in a series of meetings held by the agency over the course of the year, reports a press release delivered to the editorial offices of Lenta.ru.',
'4': 'At the third meeting, it is planned that the results of the work conducted by the Expert Session over the past year will be presented and that a final checklist to evaluate the effectiveness of scientific organizations will be developed.'}})})

有没有更简单的方法来读取文件并获得相同的 eval_docs 嵌套字典?

不使用 BeautifulSoup 可以轻松完成吗?

请注意,在示例中,只有一个 setid 和一个 docid,但实际文件不止一个。

最佳答案

由于您拥有的是一个外观类似于 XML 的 HTML,因此您无法使用基于 XML 的工具。在大多数情况下,您的选择是

  • 实现 SAX 解析器
  • 使用 BS4(您已经在使用)
  • 使用lxml

无论如何,您最终都会花费更多的时间和精力,并且需要更大的代码来处理这个问题。你所拥有的真的很圆滑和容易。如果我是你,我不会寻找其他解决方案。

PS:还有什么比 10 行代码更简单的呢!

关于python - 是否有更简单的方法将 xml 文件解析为嵌套数组?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49932795/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com