gpt4 book ai didi

python - 使用 BeautifulSoup 和 Requests 解析 html 页面源代码时内存泄漏

转载 作者:太空狗 更新时间:2023-10-30 00:12:42 26 4
gpt4 key购买 nike

因此,基本思想是通过使用 beautifulsoup 删除 HTML 标记和脚本,向某些列表 URL 发出获取请求并解析来自这些页面源的文本。 python 版本 2.7

问题是,在每次请求时,解析器函数都会在每次请求时不断添加内存。尺寸逐渐增大。

def get_text_from_page_source(page_source):
soup = BeautifulSoup(open(page_source),'html.parser')
# soup = BeautifulSoup(page_source,"lxml")
# kill all script and style elements
for script in soup(["script", "style"]):
script.decompose() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

# print text
return text

甚至在本地文本文件解析内存泄漏。例如:

#request 1
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #100 MB

#request 2
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #150 MB
#request 3
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #300 MB

enter image description here

最佳答案

你可以尝试调用垃圾收集器:

import gc
response.close()
response = None
gc.collect()

这也可能对您有所帮助:Python high memory usage with BeautifulSoup

关于python - 使用 BeautifulSoup 和 Requests 解析 html 页面源代码时内存泄漏,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51894849/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com