gpt4 book ai didi

python - 使用 Python 从 HTML 中提取纯文本

转载 作者:行者123 更新时间:2023-12-01 03:45:52 25 4
gpt4 key购买 nike

我正在尝试使用 python 从网站中提取纯文本。我的代码是这样的(我在这里找到的代码略有修改的版本):

import requests
import urllib
from bs4 import BeautifulSoup
url = "http://www.thelatinlibrary.com/vergil/aen1.shtml"
r = requests.get(url)
k = r.content
file = open('C:\\Users\\Anirudh\\Desktop\\NEW2.txt','w')
soup = BeautifulSoup(k)
for script in soup(["Script","Style"]):
script.exctract()
text = soup.get_text
file.write(repr(text))

这似乎不起作用。我猜 beautifulsoup 不接受 r.content。我可以做什么来解决这个问题?

这是错误 -

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 8 of the file C:/Users/Anirudh/PycharmProjects/untitled/test/__init__.py. To get rid of this warning, change code that looks like this:

BeautifulSoup([your markup])

to this:

BeautifulSoup([your markup], "html.parser")

markup_type=markup_type))
Traceback (most recent call last):
File "C:/Users/Anirudh/PycharmProjects/untitled/test/__init__.py", line 12, in <module>
file.write(repr(text))
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x97' in position 2130: character maps to <undefined>

Process finished with exit code 1

最佳答案

“错误”是一个警告,没有任何后果。使用 soup = BeautifulSoup(k, 'html.parser') 使其安静

似乎有一个拼写错误script.exctract()单词extract拼写错误。

实际的错误似乎是内容是字节串,但您正在以文本模式编写。源包含一个破折号。处理这个角色是个问题。

您可以使用soup.encode("utf-8")进行编码。这意味着将编码硬编码到脚本中(这很糟糕)。或者尝试对文件 open(..., 'wb') 使用二进制模式,或者在将内容传递给 Beautiful Soup 之前将其转换为字符串,使用该文件的正确编码,并使用 k = str(r.content,"utf-8").

关于python - 使用 Python 从 HTML 中提取纯文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38942474/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com