python - 从 xml 中提取文本时保留换行符-6ren

python - 从 xml 中提取文本时保留换行符

转载作者：太空宇宙更新时间：2023-11-04 01:02:20

26

4

我的 XML(从 .docx 中提取):

<w:p>
  <w:pPr>
    <w:pStyle w:val="Normal"/>
    <w:rPr/>
  </w:pPr>
  <w:r>
    <w:rPr/>
    <w:t>0 things and stuff</w:t>
  </w:r>
</w:p>
<w:p>
  <w:pPr>
    <w:pStyle w:val="Normal"/>
    <w:rPr/>
  </w:pPr>
  <w:r>
    <w:rPr/>
    <w:t>1 things and stuff</w:t>
  </w:r>
</w:p>

期望的输出:

0 things and stuff
1 things and stuff

实际输出:

0 things and stuff1 things and stuff

我尝试使用 lxml 包，希望他们使用 pretty_print 的 tostring 方法会产生比默认 xml 包更好的结果。

在研究问题时，我发现在 tostring 中使用 method='text' 会导致所有格式丢失。

我的代码:

tree = etree.fromstring(xml_content)
docx_text = etree.tostring(tree, method='text')

我试过使用 pretty_print=True、tostringlist 和 tounicode。我只是在寻找此软件包中不存在的功能吗？

最佳答案

您需要一个能够理解 docx xml 语义的所有业务逻辑的解析器，例如因为这两个文本行位于不同的段落中，所以它们应该显示在不同的行中。

你可以尝试自己做，但我建议使用类似 docx 的东西-- 或者至少看看源代码中的 getdocumenttext() 函数，了解一种解决此问题的方法。

import os
from docx import getdocumenttext
from lxml import etree

# get `xml_content` from word doc...    

tree = etree.fromstring(xml_content)
paragraphs = getdocumenttext(tree)
print(os.linesep.join(paragraphs))
# Result: 
# 0 things and stuff
# 1 things and stuff

更新:有关完全可重现的示例，请参见下文

import os
from docx import getdocumenttext, opendocx
from lxml import etree

## load the xml tree from word document ##
# EITHER:
tree = opendocx('/path/to/my/file.docx')

# OR
xml_content = """<?xml version="1.0" encoding="utf-8"?>
<w:document mc:Ignorable="w14 w15 wp14" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
 <w:body>
  <w:p>
   <w:r>
    <w:t>0 things and stuff</w:t>
   </w:r>
  </w:p>
  <w:p>
   <w:r>
    <w:t>1 things and stuff</w:t>
   </w:r>
  </w:p>
 </w:body>
</w:document>
"""
tree = etree.fromstring(xml_content)
##

paragraphs = getdocumenttext(tree)
print(os.linesep.join(paragraphs))

关于python - 从 xml 中提取文本时保留换行符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/32279927/

26

4

0

文章推荐： html - Div 布局未获得与预期相同的结果

文章推荐： c - 可变参数在 CC 编译器上的 AIX 5.3 中不起作用

文章推荐： c - 如何/何时在 C 语言中声明函数中的静态变量？

文章推荐： python - Python(和 R)和 Stata 中的线性回归之间的区别

php - 在文本区域中捕获换行符(换行符、换行符)
我有一个带有的表格我想在服务器端捕获该文本区域中的任何换行符，并将它们替换为 . 这可能吗？我尝试设置 white-space:pre在 textarea 上的 CSS，但仍然不够。最佳答案
c# - 如何检测文件是否有 Unix 换行符 (\n) 或 Windows 换行符 (\r\n)？
我正在通过 FileStream 更改文件(这是一个非常大的文件，我只需要更改标题而不重写整个文件。该文件可以有 Unix 或 Windows 换行符，知道哪一个对我来说很重要，这样我可以在更新文件
windows - 如何将 DOS/Windows 换行符 (CRLF) 转换为 Unix 换行符 (LF)
如何以编程方式(不使用 vi)将 DOS/Windows 换行符转换为 Unix 换行符？ dos2unix 和 unix2dos 命令在某些系统上不可用。如何使用 sed、awk 和 tr 等命令
BiBtex 换行符
我编辑了一个 BibteX 文件，到目前为止，它显示了我需要的所有信息。 FUNCTION {electronic} { output.bibitem format.btitle "title"
text - 换行符
您的软件是否处理来自其他系统的换行符？ Linux/BSD linefeed ^J 10 x0A Windows/IBM return linefeed ^M^
Windows命令行重定向下的c++换行符
我发现 Windows 命令行重定向会自动将 '\n' 替换为 '\r\n'。有什么方法可以避免这种情况？因为在 stdout 或 stderr 重定向之后，如果您将 '\r\n' 写入控制台，您将得
JavaScript 换行符
来自 this question ，这个…… lines = foo.value.split(/\r\n|\r|\n/); 是拆分字符串的一种方法，但如何用换行符将其连接回去？另外，我想知道如果我说
PDFsharp 换行符
我正在尝试获取新行，但如果我使用 \n 它不起作用。任何通过向字符串添加一些东西来换行的方法，例如 \r\n (这也不起作用) gfx.DrawString("Project No \n" + te
javascript - 如何迭代多行字符串值的拆分结果数组以重新格式化某些行/换行符？
我有一串数据，中间有换行符。例如: "Product Name \n Product Color \n Product Quantity \n Product Location \n Product
maven - scmCommentPrefix 换行符
我正在尝试让 scmCommentPrefix 按照 http://maven.apache.org/plugins/maven-release-plugin/faq.html#scmCommentP
php preg_match 换行符
如何检查正则表达式 /^\n/在字符串中 blahblahblah 我似乎无法从 php 的 preg_match 获得返回值 1 . 编辑: 由于某种原因，我的坏处是 CR 本身就是我的换行符。
powershell - 写主机与写输出-换行符
我很难在与文本字符串相同的行上输出变量。当我使用Write-Host而不是Write-Output时，它可以工作。我想使用Write-Output，因为这似乎是最佳做法(将内容保留在管道中)，但是Wr
string - Powershell函数可捕获字符串长度而无需返回/换行符
我正在Powershell中工作，以为here字符串中的特定单词着色。除包含回车/换行符的单词外，其他功能均有效。没有这些字符，如何计算单词的长度？以下是我正在使用的功能和测试数据。我希望第二行上的
powershell - 在Powershell脚本中添加换行符(换行符)
我有一个在Powershell中运行的脚本，并且我希望能够在脚本名称和脚本内容本身之间的结果文本文件输出中添加一行。当前，从下面开始，行$str_msg = $file,[System.IO.Fil
tokenize - Smalltalk，换行符
有人知道smalltalk中字符串的换行符是什么吗？我试图将字符串拆分为单独的行，但我无法弄清楚smalltalk 中的换行符是什么。即。 string := 'smalltalk is
PHP printf 换行符
我有以下 printf 语句: printf ("%s (%s)\n",$row["word"], $row["definition"]); 我正在尝试解决换行符而不是: word defin
javascript - 如何在正则表达式中匹配空格、换行符
这个问题已经有答案了: how to use dotall flag for regex.exec() (4 个回答) 已关闭 7 年前。字符串内容
CSS 内容 - 换行符
我想用 CSS 换行。我正在使用内容。 td:before { content: "Test\A Test2"; } 它不工作。如何正确
c++ - 原始字符串文字中的回车符+换行符？
考虑一个具有 UNIX 行结尾的 C++ 文件(即 '\x0a' 而不是 "\x0d\x0a")并包含以下原始字符串文字: const char foo[] = R"(hello^M )"; (其中
PHP printf 换行符
我有以下 printf 语句: printf ("%s (%s)\n",$row["word"], $row["definition"]); 我正在尝试解决换行符而不是: word defin

首页

博学

6Ren·AI

商城

python - 从 xml 中提取文本时保留换行符