gpt4 book ai didi

Python 3 UnicodeDecodeError : 'charmap' codec can't decode byte 0x9d

转载 作者:IT老高 更新时间:2023-10-28 20:31:04 28 4
gpt4 key购买 nike

我想做搜索引擎,我在一些网络上学习教程。我想测试解析 html

from bs4 import BeautifulSoup

def parse_html(filename):
"""Extract the Author, Title and Text from a HTML file
which was produced by pdftotext with the option -htmlmeta."""
with open(filename) as infile:
html = BeautifulSoup(infile, "html.parser", from_encoding='utf-8')
d = {'text': html.pre.text}
if html.title is not None:
d['title'] = html.title.text
for meta in html.findAll('meta'):
try:
if meta['name'] in ('Author', 'Title'):
d[meta['name'].lower()] = meta['content']
except KeyError:
continue
return d

parse_html("C:\\pdf\\pydf\\data\\muellner2011.html")

然后出现错误

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 867: character maps to <undefined>enter code here

我在网上看到了一些使用 encode() 的解决方案。但我不知道如何在代码中插入 encode() 函数。谁能帮帮我?

最佳答案

在 Python 3 中,文件以文本形式打开(解码为 Unicode);您无需告诉 BeautifulSoup 解码的编解码器。

如果数据解码失败,那是因为你没有告诉 open() 在读取文件时调用什么编解码器;使用 encoding 参数添加正确的编解码器:

with open(filename, encoding='utf8') as infile:
html = BeautifulSoup(infile, "html.parser")

否则文件将使用您的系统默认编解码器打开,这取决于操作系统。

关于Python 3 UnicodeDecodeError : 'charmap' codec can't decode byte 0x9d,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30750843/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com