gpt4 book ai didi

python - BeautifulSoup:如何在输出中包含编码?

转载 作者:太空宇宙 更新时间:2023-11-03 12:09:58 28 4
gpt4 key购买 nike

我想使用 BeautifulSoup.BeautifulStoneSoup 在 XML 文档中包含编码标签,但我不确定怎么做!

<?xml version="1.0" encoding="UTF-8"?>
<mytag>stuff</mytag>

当我阅读一个已经有编码标签的文档时,它会输出编码标签,但我正在做一个新汤。

谢谢!

编辑:我将举例说明我目前正在做的事情。

from BeautifulSoup import BeautifulStoneSoup, Tag
soup = BeautifulStoneSoup()
mytag = Tag(soup, 'mytag')
soup.append(mytag)

str(soup)
# '<mytag></mytag>'

soup.prettify() # No encoding given
# '<mytag>\n</mytag>'

soup.prettify(encoding='UTF-8')
# '<mytag>\n</mytag>' # Where's the encoding?

即使我制作像 BeautifulStoneSoup(fromEncoding='UTF-8') 这样的汤, 仍然没有<?xml?>标签。

是否有另一种方法来获取该标签而不直接创建标签并将其作为字符串传递,或者这是唯一的方法吗?

最佳答案

你的意思是这样的吗?

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup('<?xml version="1.0" encoding="UTF-8"?>')
# make some more soup

或者,

soup = BeautifulStoneSoup()
# make some more soup
soup.insert(0, '<?xml version="1.0" encoding="UTF-8"?>')

来自 BeautifulSoup documentation :

Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:

  • An encoding you pass in as the fromEncoding argument to the soup constructor.
  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252

Beautiful Soup will almost always guess right if it can make a guess at all. But for documents with no declarations and in strange encodings, it will often not be able to guess.

注意第 2 项,我读作:BeautifulSoup 将自动使用 xml 声明中的编码,如果您没有使用 fromEncoding 参数明确指定编码。 YMMV.

在之前引用的 documentation 中还有其他可能有用的 unicode 相关示例,还有。


编辑:@TorelTwiddler,如果有另一种方法可以使用 BeautifulSoup 添加 xml 声明而不直接将标签作为字符串传递,我不知道。

也就是说,请考虑以下几点:

soup = BeautifulStoneSoup('<?xml version="1.0" encoding=""?>') # <- no encoding
mytag = Tag(soup, 'mytag')
soup.append(mytag)

print str(soup)
# "<?xml version='1.0' encoding='utf-8'?><mytag></mytag>"
# Wha!? :)
print soup.prettify(encoding='euc-jp')
# <?xml version='1.0' encoding='euc-jp'?>
# <mytag>
# </mytag>

也许这会帮助您到达您想去的地方。

关于python - BeautifulSoup:如何在输出中包含编码?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7100514/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com