gpt4 book ai didi

python - 使用 python 迭代解析一个巨大的 xml 文件但出现错误

转载 作者:行者123 更新时间:2023-11-30 22:56:45 27 4
gpt4 key购买 nike

我正在尝试使用 python 解析一个巨大的 XML 文件,但收到此错误:

    File "parser.py", line 6, in <module>
event, root = text.next()
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1281, in next
self._root = self._parser.close()
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1654, in close
self._raiseerror(v)
File "C:\Python27\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

我的代码现在看起来像这样

    import xml.etree.ElementTree as ET
from StringIO import StringIO

text = ET.iterparse(StringIO('Posts.xml'), events=('start', 'end', 'start-ns', 'end-ns'))
text = iter(text)
event, root = text.next()

for event, elem in text:
currId = elem.get('PostTypeId')
if (currId != '1'):
root.remove(elem)

tree.write('cut.xml')

我试图解析的 XML 文件看起来像这样:

    <posts>

<row FavoriteCount="4" CommentCount="4" AnswerCount="7" Tags="<discussion><answers>" Title="Why would anyone accept an answer?" LastActivityDate="2014-04-23T09:14:37.103" LastEditDate="2010-09-03T00:42:07.733" LastEditorUserId="99" OwnerUserId="4" Body="<p>I'm looking at the questions proposed during the Area 51 process:</p> <ul> <li>My supervisor thinks that all <code>If</code> statements should include <code>else</code> statements. Do you agree?</li> <li>What are common mistakes in Software Development?</li> <li>Tabs vs. Spaces: What is the one proper indentation character for everything, in every situation, ever?</li> <li>What programming language should I teach to my 4 year old son?</li> <li>What was the turning point of your programming career?</li> </ul> <p>None of these have an answer that should be accepted. The questions are interesting, and the answers would also be informative if the answer was well written and explained why the answerer thinks his method or idea is better. But I can't really see being able to accept an answer to any of these questions.</p> <p>So, if I ask a question, how do I decide if or how to accept an answer? There is no right or wrong answer and just because it works for me doesn't mean I should be floating that answer to the top - unless I'm overlooking something, the questions that are on topic here are very subjective. On Stack Overflow, there are often multiple right solutions to a problem. Here, we have a problem with an infinite number of solutions, none of which are arguably better or worse than any others.</p> <p>Thoughts?</p> " ViewCount="1582" Score="30" CreationDate="2010-09-01T19:32:45.710" PostTypeId="1" Id="1"/>

<row CommentCount="0" AnswerCount="4" Tags="<discussion><site-attributes><faq-contents><top-7>" Title="What should our FAQ contain?" LastActivityDate="2015-03-18T19:19:24.887" LastEditDate="2015-03-18T19:19:24.887" LastEditorUserId="25936" OwnerUserId="9" Body="<p>One of the big 7 questions.</p> " ViewCount="318" Score="6" CreationDate="2010-09-01T19:34:51.797" PostTypeId="1" Id="2" CommunityOwnedDate="2010-09-02T03:42:26.083"/>

<row FavoriteCount="8" CommentCount="8" AnswerCount="32" Tags="<discussion><top-7><site-attributes>" Title="What should our domain name be?" LastActivityDate="2014-04-23T09:14:37.103" LastEditDate="2010-12-20T02:46:31.950" LastEditorUserId="2314" OwnerUserId="9" Body="<blockquote> <p><strong>Possible Duplicate:</strong><br> <a href="http://meta.programmers.stackexchange.com/questions/412/write-an-elevator-pitch-tagline">Write an Elevator Pitch / Tagline</a> </p> </blockquote> <h2>Note:</h2> <p>We are closing this domain naming thread. It is asking the <em>entirely</em> wrong question. See this blog post for details: <a href="http://blog.stackoverflow.com/2010/10/domain-names-the-wrong-question/" rel="nofollow">Domain Names: Wrong Question</a> </p> <p>We're going to keep the name programmers.stackexchange.com. But we WILL be setting up redirects from the more "popular" domains names. (e.g. seasonedadvice.com to cooking.stackexchange.com, basicallymoney.com to money.stackexchange.com, and others as we go through the list).</p> <p>New question: "<strong>Write an Elevator Pitch / Tagline!</strong>"</p> <p><a href="http://meta.programmers.stackexchange.com/questions/412/write-an-elevator-pitch-tagline"><strong>Click here to contribute ideas and vote.</strong></a> </p> <p><em>[original message text below]</em></p> <p>One of the big 7 questions.</p> <ul> <li>One answer per answer please</li> <li>Only .com domain names please</li> <li>Only untaken domain names please (use whois)</li> </ul> <p>Please use <strong>lowercase characters only</strong> in domain name!<br> DomainName.com is more readable, but we have to register domainname.com!</p> " ViewCount="1146" Score="16" CreationDate="2010-09-01T19:36:08.390" PostTypeId="1" Id="3" CommunityOwnedDate="2010-09-02T03:40:00.467" ClosedDate="2010-10-08T21:02:50.313"/>
...

</posts>

最佳答案

ElementTree.iterparse 需要某种源。您要为其提供一个字符串缓冲区,其中包含内容 Posts.xml,而不是文件 Posts.xml 的实际内容,后者显然不具有正确的 xml 文件语法。

因此,只需摆脱 StringIO 调用,ElementTree 就会为您打开文件。然而,您的输入文件还存在一些问题,导致您的文件无法正确解析(请参阅 sverasch 的答案)。

关于python - 使用 python 迭代解析一个巨大的 xml 文件但出现错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36924383/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com