gpt4 book ai didi

python - 有没有办法读取.docx文件包括使用python-docx的自动编号

转载 作者:太空狗 更新时间:2023-10-29 17:20:47 24 4
gpt4 key购买 nike

问题陈述:从 .docx 文件中提取部分,包括自动编号。

我尝试使用 python-docx 从 .docx 文件中提取文本,但它排除了自动编号。

from docx import Document

document = Document("wadali.docx")


def iter_items(paragraphs):
for paragraph in document.paragraphs:
if paragraph.style.name.startswith('Agt'):
yield paragraph
if paragraph.style.name.startswith('TOC'):
yield paragraph
if paragraph.style.name.startswith('Heading'):
yield paragraph
if paragraph.style.name.startswith('Title'):
yield paragraph
if paragraph.style.name.startswith('Heading'):
yield paragraph
if paragraph.style.name.startswith('Table Normal'):
yield paragraph
if paragraph.style.name.startswith('List'):
yield paragraph


for item in iter_items(document.paragraphs):
print item.text

最佳答案

目前看来 python-docx v0.8 不完全支持编号。您需要进行一些黑客攻击。

首先,对于demo,要迭代文档段落,需要自己写迭代器。这是一些功能性的东西:

import docx.document
import docx.oxml.table
import docx.oxml.text.paragraph
import docx.table
import docx.text.paragraph


def iter_paragraphs(parent, recursive=True):
"""
Yield each paragraph and table child within *parent*, in document order.
Each returned value is an instance of Paragraph. *parent*
would most commonly be a reference to a main Document object, but
also works for a _Cell object, which itself can contain paragraphs and tables.
"""
if isinstance(parent, docx.document.Document):
parent_elm = parent.element.body
elif isinstance(parent, docx.table._Cell):
parent_elm = parent._tc
else:
raise TypeError(repr(type(parent)))

for child in parent_elm.iterchildren():
if isinstance(child, docx.oxml.text.paragraph.CT_P):
yield docx.text.paragraph.Paragraph(child, parent)
elif isinstance(child, docx.oxml.table.CT_Tbl):
if recursive:
table = docx.table.Table(child, parent)
for row in table.rows:
for cell in row.cells:
for child_paragraph in iter_paragraphs(cell):
yield child_paragraph

您可以使用它来查找所有文档段落,包括表格单元格中的段落。

例如:

import docx

document = docx.Document("sample.docx")
for paragraph in iter_paragraphs(document):
print(paragraph.text)

要访问编号属性,您需要在“ protected ”成员 paragraph._p.pPr.numPr 中搜索,这是一个 docx.oxml.numbering.CT_NumPr 对象:

for paragraph in iter_paragraphs(document):
num_pr = paragraph._p.pPr.numPr
if num_pr is not None:
print(num_pr) # type: docx.oxml.numbering.CT_NumPr

请注意,此对象是从 numbering.xml 文件(在 docx 内)中提取的,如果它存在的话。

要访问它,您需要像阅读包一样阅读您的 docx 文件。例如:

import docx.package
import docx.parts.document
import docx.parts.numbering

package = docx.package.Package.open("sample.docx")

main_document_part = package.main_document_part
assert isinstance(main_document_part, docx.parts.document.DocumentPart)

numbering_part = main_document_part.numbering_part
assert isinstance(numbering_part, docx.parts.numbering.NumberingPart)

ct_numbering = numbering_part._element
print(ct_numbering) # CT_Numbering
for num in ct_numbering.num_lst:
print(num) # CT_Num
print(num.abstractNumId) # CT_DecimalNumber

更多信息可在 Office Open XMl 中找到文档。

关于python - 有没有办法读取.docx文件包括使用python-docx的自动编号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52094242/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com