gpt4 book ai didi

python - 如何将 lxml.etree._ElementTree 列表保存到文件

转载 作者:太空宇宙 更新时间:2023-11-03 14:32:42 28 4
gpt4 key购买 nike

我在 lxml 库方面遇到了一个恼人的问题,并且不知道如何解决它。

我有一个 lxml.etree._ElementTree 树列表和一个属于这些树的 lxml.html.HtmlElement 列表,并且将相应的路径存储在名为 paths 的列表中

element_found = [True if len(tree.xpath(path)) > 0 else False for tree,path in zip(trees,paths)]
print(element_found.count(False)) # == 0

当我尝试保存路径和树以便稍后检索此状态时,问题就出现了:

trees_to_save = [{'tree': lxml.etree.tostring(tree, pretty_print=True)} for tree in trees]
t2sdf = pd.DataFrame(trees_to_save)
t2sdf.to_csv('trees.csv')

EncodeForamt = lxml.html.HTMLParser(encoding='utf-8')

trees_from_file = pd.read_csv('trees.csv')
trees_from_file['tree'] = trees_from_file['tree'].apply(lambda x: etree.HTML(literal_eval(x),EncodeForamt).getroottree())

然后运行相同的测试:

element_found = [True if len(tree.xpath(path)) > 0 else False for tree,path in zip(trees_from_file,paths)]
print(element_found.count(False)) # == 6 (out of 12k)

一般来说,我试图完成找到的所有路径,显然存在往返字符串方法以及我如何保存树的问题。我已经尝试了lxml库中的各种方法,例如tree.write,而不是字符串,而不是literal_eval只是.encode('utf-8'),但没有效果,有或没有pretty_print,也尝试了etree.from_string()一切都相同的结果...

令人担忧的是,这也会引发 XML 语法错误:

trees = [etree.fromstring(etree.tostring(t)) for t in trees]

我有点不知道如何正确保存这些树木......

最佳答案

好吧,在尝试了我能找到的所有内容之后,我想出了如何完成此操作,需要使用解析而不是 tostring:

trees_to_save = [{'tree': lxml.etree.tostring(tree,encoding='utf-8',method='html')} for tree in trees]
t2sdf = pd.DataFrame(trees_to_save)
t2sdf.to_csv('location_trees.csv')

trees_from_file = pd.read_csv('location_trees.csv')
EncodeForamt = lxml.etree.HTMLParser(encoding='utf-8')
trees_from_file['tree'] = trees_from_file['tree'].apply(lambda x: lxml.etree.parse(x,parser=EncodeForamt))

关于python - 如何将 lxml.etree._ElementTree 列表保存到文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47168397/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com