- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在尝试运行 tensorflow 项目,但在大学 HPC 集群上遇到内存问题。我必须为数百个不同长度的输入运行预测作业。我们有具有不同数量 vmem 的 GPU 节点,所以我试图以一种不会在 GPU 节点 - 输入长度的任何组合中崩溃的方式设置脚本。
在网上搜索解决方案后,我尝试了 TF_FORCE_UNIFIED_MEMORY、XLA_PYTHON_CLIENT_MEM_FRACTION、XLA_PYTHON_CLIENT_PREALLOCATE 和 TF_FORCE_GPU_ALLOW_GROWTH,以及 tensorflow 的 0x1046。据我了解,通过统一内存,我应该能够使用比 GPU 本身更多的内存。
这是我的最终解决方案(仅相关部分)
os.environ['TF_FORCE_UNIFIED_MEMORY']='1'
os.environ['XLA_PYTHON_CLIENT_MEM_FRACTION']='2.0'
#os.environ['XLA_PYTHON_CLIENT_PREALLOCATE']='false'
os.environ['TF_FORCE_GPU_ALLOW_GROWTH ']='true' # as I understood, this is redundant with the set_memory_growth part :)
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
if gpus:
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
print(gpu)
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Memory growth must be set before GPUs have been initialized
print(e)
我用
set_memory_growth
(slurm 作业调度程序)和
--mem=30G
在集群上提交它。
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5582 MB memory) -> physical GPU (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:02:00.0, compute capability: 3.5)
2021-08-24 09:22:02.053935: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 12758286336 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:03.738635: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 11482457088 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:05.418059: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 10334211072 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:07.102411: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 9300789248 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:08.784349: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 8370710016 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:10.468644: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 7533638656 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:22:12.150588: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:764] failed to alloc 6780274688 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-08-24 09:23:10.326528: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.33GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
Traceback (most recent call last):
File "scripts/script.py", line 654, in <module>
prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "env/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "env/lib/python3.7/site-packages/jax/_src/api.py", line 402, in cache_miss
donated_invars=donated_invars, inline=inline)
File "env/lib/python3.7/site-packages/jax/core.py", line 1561, in bind
return call_bind(self, fun, *args, **params)
File "env/lib/python3.7/site-packages/jax/core.py", line 1552, in call_bind
outs = primitive.process(top_trace, fun, tracers, params)
File "env/lib/python3.7/site-packages/jax/core.py", line 1564, in process
return trace.process_call(self, fun, tracers, params)
File "env/lib/python3.7/site-packages/jax/core.py", line 607, in process_call
return primitive.impl(f, *tracers, **params)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 608, in _xla_call_impl
*unsafe_map(arg_spec, args))
File "env/lib/python3.7/site-packages/jax/linear_util.py", line 262, in memoized_fun
ans = call(fun, *args)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 758, in _xla_callable
compiled = compile_or_get_cached(backend, built, options)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 76, in compile_or_get_cached
return backend_compile(backend, computation, compile_options)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
return backend.compile(built_c, compile_options=options)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "scripts/script.py", line 654, in <module>
prediction_result, (r, t) = cf.to(model_runner.predict(processed_feature_dict, random_seed=seed), "cpu")
File "env/lib/python3.7/site-packages/alphafold/model/model.py", line 134, in predict
result, recycles = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "env/lib/python3.7/site-packages/jax/interpreters/xla.py", line 373, in backend_compile
return backend.compile(built_c, compile_options=options)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 4649385984 bytes.
对于如何让它工作并使用所有可用内存的任何想法,我会很高兴。
最佳答案
看起来您的 GPU 不完全支持统一内存。支持是有限的,实际上 GPU 将所有数据保存在其内存中。
描述见这篇文章:https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
特别是:
On systems with pre-Pascal GPUs like the Tesla K80, calling cudaMallocManaged() allocates size bytes of managed memory on the GPU device that is active when the call is made. Internally, the driver also sets up page table entries for all pages covered by the allocation, so that the system knows that the pages are resident on that GPU.
Since these older GPUs can’t page fault, all data must be resident on the GPU just in case the kernel accesses it (even if it won’t).
关于tensorflow - 未能分配 X 字节的统一内存;结果 : CUDA_ERROR_OUT_OF_MEMORY: out of memory,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68902851/
我正在为期末考试学习,但我无法理解这个 FC 算法: 我理解你标准化每条规则的部分。然后我认为下一行是说对于满足广义 Modus Ponens (p'_iTheta = p_iTheta) 的每个 t
我有一个 3d 世界,它有一个 simpel 平台和一个代表玩家的立方体。当我旋转平台时,立方体会滑动并按照您预期的方式执行,增加和减少物理 Material 中的摩擦力。 我希望立方体在输入例如 f
所以我的 Unity 项目有一个大问题。我昨天工作,我没有做备份今天,在我打开项目后,我的笔记本电脑因电池电量不足而关机。之后,当我进入项目时,我得到了这个:加载“Assets/MyScene.uni
好的,我正在尝试创建一个函数来确定元组列表是否是可传递的,即如果 (x,y) 和 (y,z) 在列表中,那么 (x,z) 也在列表中。 例如,[(1,2), (2,3), (1,3)]是传递的。 现在
这个问题在这里已经有了答案: How to pass data between scenes in Unity (5 个回答) 9 个月前关闭。 我有一个游戏,我有一个队列匹配系统。 我想向玩家展示他
我现在正在为我的游戏创建一个 keystore (统一)但是当我按下添加键按钮时,会弹出一个错误 Java Development Kit (JDK) directory is not set or
我想将YouTube流视频放入Cardboard(适用于Android和iOS)应用中。我知道这些插件可以执行类似的操作,例如“Easy Movie Texture”,但它们不支持YouTube流媒体
我需要限制 ConfigurableJoint 的目标旋转以避免关节变形或破坏。 为了了解角度限制的工作原理,我做了一个实验。 在场景中放置一个人形模型。 为骨骼添加ConfigurableJoint
尝试实现一种有限形式的匹配统一。 尝试匹配两个公式匹配如果我们能找到替代出现在公式中的变量使得两者在句法上是等价。 我需要写一个函数来判断一个对应于基本项的常数,例如 Brother(George)
我正在使用 Unity 和 C#我想在运行时将输出日志文件发送到我的电子邮件,我使用了来自 this question 的 ByteSheep 答案和来自 this question 的 Arkane
关闭。这个问题需要debugging details .它目前不接受答案。 编辑问题以包含 desired behavior, a specific problem or error, and th
我希望能够将鼠标悬停在游戏对象(代理)上并在右键或左键单击时创建一个类似于 Windows 右键单击菜单的 float 菜单。我试过结合使用 OnGUI() 和 OnMouseOver() 但我要
我正在为 oculus Gear VR 开发游戏(考虑内存管理),我需要在特定时间(以秒为单位)后加载另一个屏幕 void Start () { StartCoroutine (loadSce
我设法生成了敌人,但它们一直在生成。如何设置限制,避免不断生成? 我已经尝试添加 spawnLimit 和 spawnCounter 但无法让它工作。 var playerHealth = 100;
我正在参加使用 Unity 进行游戏开发的在线类(class),讲师有时会含糊不清。我的印象是使用游戏对象与使用游戏对象名称(在本例中为 MusicPlayer)相同,但是当我尝试将 MusicPla
关闭。这个问题需要更多focused .它目前不接受答案。 想改进这个问题吗? 更新问题,使其只关注一个问题 editing this post . 关闭 6 年前。 Improve this qu
为了好玩,我正在(用 Java)开发一个使用统一算法的应用程序。 我选择了我的统一算法返回所有可能的统一。例如,如果我尝试解决 添加(X,Y)=成功(成功(0)) 返回 {X = succ(succ(
如何让对象在一段时间后不可见(或只是删除)?使用 NGUI。 我的示例(更改): public class scriptFlashingPressStart : MonoBehaviour {
我有下一个错误: The type or namespace name 'NUnit' could not be found (are you missing a using directive or
这是可以做到的 但是属性 autoSizeTextType 只能用于 API LEVEL >= 26,并且 Android Studio 会显示有关该问题的烦人警告。 为了摆脱这个问题,我想以编程方
我是一名优秀的程序员,十分优秀!