gpt4 book ai didi

Python-docx 提取的缺少单词的字符串

转载 作者:行者123 更新时间:2023-12-04 21:04:18 25 4
gpt4 key购买 nike

我不明白为什么没有从下面的代码中提取“特拉华州”这个词。每个其他字符都被提取出来。谁能提供从下面的 Docx 文件中提取“特拉华州”一词的代码,而无需手动更改文件?

输入:

import docx
import io
import requests

url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)

for text in docx.Document(file).paragraphs:
print(text.text)

输出:

APPLICABLE LAW This Agreement is to be construed and interpreted according to the laws of the State of , excluding its conflict of laws provisions. The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.



关于它最奇怪的部分是,如果我对文档中的“特拉华州”一词(ee.gg.,粗体/非粗体,键入该词)进行任何操作,然后保存它,则“特拉华州”一词不再缺少下次我运行代码时。但是,仅保存文件而不更改单词并不能解决问题。您可能会说解决方案是手动更改单词,但实际上我正在处理成千上万个这样的文档,手动一个一个地更改每个文档是没有意义的。

答案在 Missing document text when using python-docx似乎提供了为什么不能提取这个“特拉华州”的原因,但它没有提供解决方案。谢谢。

最佳答案

我相信@smci 是对的。这很可能是由以下原因解释的:Missing document text when using python-docx .然而,这并没有提供解决方案。

我认为在这种情况下我们唯一的选择是退回到读取 XML 文件。考虑来自网页 http://etienned.github.io/posts/extract-text-from-word-docx-simply/ 的这个函数(简化)例如:

try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
import io
import requests

def get_docx_text(path):
"""Take the path of a docx file as argument, return the text in unicode."""

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'

document = zipfile.ZipFile(path)
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)

paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [n.text for n in paragraph.getiterator(TEXT) if n.text]
if texts:
paragraphs.append(''.join(texts))

return '\n\n'.join(paragraphs)

url = 'https://github.com/python-openxml/python-docx/files/1996979/Delaware_Test.docx'
file = io.BytesIO(requests.get(url).content)
print(get_docx_text(file))

我们得到:
APPLICABLE LAW

This Agreement is to be construed and interpreted according to the laws of the State of Delaware, excluding its conflict of laws provisions. The provisions of the U. N. Convention on Contracts for the International Sale of Goods shall not apply to this Agreement.

关于Python-docx 提取的缺少单词的字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50301279/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com