gpt4 book ai didi

python - 内存分配失败: growing buffer - Python

转载 作者:行者123 更新时间:2023-12-03 13:03:44 27 4
gpt4 key购买 nike

我正在研究一个脚本,该脚本可以爬取数千个不同的网页。由于这些页面通常是不同的(具有不同的站点),因此我使用多线程来加快抓取速度。

编辑:简单的简短说明

-------

我正在300个工作人员的一个池中加载300个url(html)。由于html的大小是可变的,因此有时大小之和可能太大,而python会引发:internal buffer error : Memory allocation failed : growing buffer。我想以某种方式检查这种情况是否会发生,以及是否等到缓冲区未满为止。

-------

这种方法有效,但有时python开始抛出:

internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer
internal buffer error : Memory allocation failed : growing buffer

进入控制台。我想这是因为我存储在内存中的 html的大小可能是300 *(例如1mb)= 300mb

编辑:

我知道我可以减少 worker 的数量,而且我会的。但这不是解决方案,出现这种错误的机会会更少。我想完全避免此错误...

我开始记录 html的大小:
ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))

结果是(一部分):
2017-03-05 13:02:04,914 DEBUG SIZE: 243940
2017-03-05 13:02:05,023 DEBUG SIZE: 138384
2017-03-05 13:02:05,026 DEBUG SIZE: 1185964
2017-03-05 13:02:05,141 DEBUG SIZE: 1203715
2017-03-05 13:02:05,213 DEBUG SIZE: 291415
2017-03-05 13:02:05,213 DEBUG SIZE: 287030
2017-03-05 13:02:05,224 DEBUG SIZE: 1192165
2017-03-05 13:02:05,230 DEBUG SIZE: 1193751
2017-03-05 13:02:05,234 DEBUG SIZE: 359193
2017-03-05 13:02:05,247 DEBUG SIZE: 23703
2017-03-05 13:02:05,252 DEBUG SIZE: 24606
2017-03-05 13:02:05,275 DEBUG SIZE: 302388
2017-03-05 13:02:05,329 DEBUG SIZE: 334925

这是我简化的抓取方法:
def scrape_chunk(chunk):
pool = Pool(300)
results = pool.map(scrape_chunk_item, chunk)
pool.close()
pool.join()
return results

def scrape_chunk_item(item):
root_result = _load_root(item.get('url'))
# parse using xpath and return

和加载HTML的功能:
def _load_root(url):
for i in xrange(settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS):
try:
headers = requests.utils.default_headers()
headers['User-Agent'] = ua.chrome
r = requests.get(url, timeout=(settings.ENGINE_SCRAPER_REQUEST_TIMEOUT + i, 10 + i), verify=False, )
r.raise_for_status()
except requests.Timeout as e:

if i >= settings.ENGINE_NUMBER_OF_CONNECTION_ATTEMPTS - 1:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'timeout', 'traceback': tb}
except Exception:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'unknown_error', 'traceback': tb}
else:
break

r.encoding = 'utf-8'
html = r.content
ram_logger.debug('SIZE: {}'.format(sys.getsizeof(html)))
try:
root = etree.fromstring(html, etree.HTMLParser())
except Exception:
tb = traceback.format_exc()
return {'success': False, 'root': None, 'error': 'root_error', 'traceback': tb}

return {'success': True, 'root': root}

您知道如何使其安全吗?如果缓冲区溢出问题会使工作人员等待的事情?

最佳答案

您可以限制每个工作程序仅在有X可用内存时才能启动...
未测试

lock = threading.Lock()
total_mem= 1024 * 1024 * 500 #500MB spare memory
@contextlib.contextmanager
def ensure_memory(size):
global total_mem
while 1:
with lock:
if total_mem > size:
total_mem-= size
break
time.sleep(1) #or something else...
yield
with lock:
total_mem += size

def _load_root(url):
...
r = requests.get(url, timeout=(settings.ENGINE_SCRAPER_REQUEST_TIMEOUT + i, 10 + i), verify=False, stream=True) #add the stream=True to make request wait on on loading the entire request
...
with ensure_memory(r.headers['content-length']):
#now do stuff here :)
html = r.content
...
return {'success': True, 'root': root}

total_mem也可以自动计算,因此您不必猜测每台计算机的正确值是多少...

关于python - 内存分配失败: growing buffer - Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42608232/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com