python - 如何在Python中使用Textract库加载unicode字符串？-6ren

python - 如何在Python中使用Textract库加载unicode字符串？

转载作者：行者123 更新时间：2023-11-30 22:55:22

我正在使用Textract对于 Python 来说相对较新，我想加载 unicode 字符串而不是 utf-8 格式的文件。有办法做到这一点吗？

我试过了

text = textract.process(file)

但这会加载 UTF-8 字符串，而我更喜欢 unicode。我尝试使用

text = textract.process(file, encoding="unicode")

但这会引发错误。

Error
Traceback (most recent call last):
  File "/home/moha/dev/intellij-ws/pyqadi/tests/test_file2txt.py", line 11, in test_process
    str=f2t.to_txt(file)
  File "/home/moha/dev/intellij-ws/pyqadi/textsearcher/file2txt.py", line 10, in to_txt
    text = textract.process(file, encoding="unicode")
  File "/usr/local/lib/python2.7/dist-packages/textract/parsers/__init__.py", line 57, in process
    return parser.process(filename, encoding, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/textract/parsers/utils.py", line 46, in process
    return self.encode(unicode_string, encoding)
  File "/usr/local/lib/python2.7/dist-packages/textract/parsers/utils.py", line 31, in encode
    return text.encode(encoding, 'ignore')
LookupError: unknown encoding: unicode

最佳答案

Textract 使用编码来指定特定的输出编码(使用 chardet 推断输入编码

以下是用于编码的 Uncidoe 选项:

unicode_escape, unicode_internal, raw_unicode_escape
text = textract.process(file, encoding = 'unicode_escape')

这是一个exhaustive list .

底层数据采用 UTF-8 格式。您可以将 texttract.processn 作为 UTF-8 并在单独的行上将其解码为 Unicode，如下所示:

text = textract.process(file)

Utext = unicode(text,'utf-8')

关于python - 如何在Python中使用Textract库加载unicode字符串？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37448074/

文章推荐： mysql - sql删除语句

文章推荐： c# - API 的多个返回类型

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 如何在Python中使用Textract库加载unicode字符串？