gpt4 book ai didi

python - 使用 LXML 编写 XML header

转载 作者:太空狗 更新时间:2023-10-29 22:30:19 25 4
gpt4 key购买 nike

我目前正在编写一个脚本,将一堆 XML 文件从各种编码转换为统一的 UTF-8。

我首先尝试使用 LXML 确定编码:

def get_source_encoding(self):
tree = etree.parse(self.inputfile)
encoding = tree.docinfo.encoding
self.inputfile.seek(0)
return (encoding or '').lower()

如果那是空白,我会尝试从 chardet 获取它:

def guess_source_encoding(self):
chunk = self.inputfile.read(1024 * 10)
self.inputfile.seek(0)
return chardet.detect(chunk).lower()

然后我使用codecs 来转换文件的编码:

def convert_encoding(self, source_encoding, input_filename, output_filename):
chunk_size = 16 * 1024

with codecs.open(input_filename, "rb", source_encoding) as source:
with codecs.open(output_filename, "wb", "utf-8") as destination:
while True:
chunk = source.read(chunk_size)

if not chunk:
break;

destination.write(chunk)

最后,我正在尝试重写 XML header 。如果 XML header 最初是

<?xml version="1.0"?>

<?xml version="1.0" encoding="windows-1255"?>

我想把它变成

<?xml version="1.0" encoding="UTF-8"?>

我当前的代码似乎不起作用:

def edit_header(self, input_filename):
output_filename = tempfile.mktemp(suffix=".xml")

with open(input_filename, "rb") as source:
parser = etree.XMLParser(encoding="UTF-8")
tree = etree.parse(source, parser)

with open(output_filename, "wb") as destination:
tree.write(destination, encoding="UTF-8")

我当前正在测试的文件有一个未指定编码的 header 。如何让它以指定的编码正确输出 header ?

最佳答案

尝试:

tree.write(destination, xml_declaration=True, encoding='UTF-8')

来自 the API docs :

xml_declaration controls if an XML declaration should be added to the file. Use False for never, True for always, None for only if not US-ASCII or UTF-8 (default is None).

来自 ipython 的示例:

In [15]:  etree.ElementTree(etree.XML('<hi/>')).write(sys.stdout, xml_declaration=True, encoding='UTF-8')
<?xml version='1.0' encoding='UTF-8'?>
<hi/>

仔细想想,我觉得你太努力了。 lxml 自动检测编码并根据该编码正确解析文件。

所以你真正需要做的(至少在 Python2.7 中)是:

def convert_encoding(self, source_encoding, input_filename, output_filename):
tree = etree.parse(input_filename)
with open(output_filename, 'w') as destination:
tree.write(destination, encoding='utf-8', xml_declaration=True)

关于python - 使用 LXML 编写 XML header ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25798900/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com