数据中心冷却的safe-RL，基于对action的事后修正技术-6ren

数据中心冷却的safe-RL，基于对action的事后修正技术

转载作者：我是一只小鸟更新时间：2023-03-05 14:31:22

32

4

一个总述
摘要
1 intro
2 related work
3 preliminaries
4 performance of reward shaping
5 the safari approach
- 5.1 CMDP Formulation & Approach Overview
- 5.2 off-line IL
- 5.3 Online Post-hoc Rectification
6 performance evaluation

一个总述

关于本篇笔记:

论文题目：Toward Physics-Guided Safe Deep Reinforcement Learning for Green Data Center Cooling Control
- 在线看： https://ieeexplore.ieee.org/abstract/document/9797658
阅读笔记是 ~~量子速读法~~ 的产物，只能起辅助阅读的功效，无法替代论文原文。

关于论文:

motivation：减少 RL 试错过程中的 unsafe behavior。
基本思想：先 imitation learning，再在 on-line learning 时强行改可能 unsafe 的 action。
关键技术：1. 怎么判断 action 可能 unsafe，2. 怎么改 action。
- 1 再训一个 coarse model。
- 2 启发式寻找最小的改动，直到 coarse model 判断安全。
- 1 2 是结合在一起做的。
局限性：把整个房间设为同一个温度（使用 EnergyPlus 仿真）

摘要

声称自己提出的：safety-aware DRL framework，for single-hall data center cooling control。
技术路线：off-line imitation learning + online post-hoc rectification。
rectification：使用基于 historical safe operation traces 来 fit 的 thermal state transition model，还能外推 unsafe state。

1 intro

Post-hoc rectification：去修正 DRL 提供的 action，来保证 won't drive the system to the unsafe region，采用最小的修正（rectification）。
Safari（offline 模仿学习 + online 修正）的优势：拟合状态转换模型时的低开销和对数据的低需求（即只需要安全数据）。

simplex 单纯形：进入到 unsafe region，再回退；post-hoc rec：主动修改 action 使其安全.

3 preliminaries

用 EnergyPlus 仿真，假设整个房间的温度是一样的（uniform distribution）。
设定点（action）好像是 mass flow rate 和 T_in。

4 performance of reward shaping

本章内容：MDP setting 和 RL（DDPG）.

state:

T_z 室内温度, T_in 空调温度, P_c ACU 功率, P_IT 负载, T_o 户外温度。
假设 P_IT 和 T_o 都是 Markovian（？）

reward:

1 goal 项：exp(ΔT²) 惩罚的超温 - P_DC（总功率 = P_IT + P_c）。
2 shaping 项：用于控制 T_z 在 T_L T_U 之间，两个 max(0, ΔT_{过高/过低} ) 相加。

（迄今为止我都不知道 ACU P_c 是怎么算的）。

5 the safari approach

5.1 CMDP Formulation & Approach Overview

如 subsection 标题。定义了 constrained MDP 的问题.

5.2 off-line IL

发现模仿学习可以让前三天不犯错，但后面 RL 继续试错还是会犯错的.

5.3 Online Post-hoc Rectification

用以前的数据再 train 一个 coarse model：灰盒或黑盒.

LSTM：MAE 在 0.5℃。需要 unsafe 数据（exploratory data）。
safari1：
- 关于 KKT 条件： https://zhuanlan.zhihu.com/p/556832103

                        
                          - 好像就是解了一个 KKT 条件。让数学形式如此简单的关键，是它的 DC 状态方程简单。

safari2： heuristic 方法。
- For T_in(t) and f(t), we adopt their setpoints as their approximations. 又干了这种直接拿 setpoint 当真实值的事情。
- 然后直接把 Eq.(1) 积分了！
- 然后，为了减少拿 setpoint 近似 flow rate 和温度的影响，用这段时间的 T_z 的积分作为最终的 T_z 值。
safari3：
- 搞了一个所有 IT power trace 的上包络线( maximum ramp-up function)。
- 然后直接用这个作为输入 Q，重新 forward DRL 模型。

比较:

LSTM using unsafe data 的性能最好，其次是 transient 的 safari2。

6 performance evaluation

没有什么要说的.

最后此篇关于数据中心冷却的safe-RL，基于对action的事后修正技术的文章就讲到这里了,如果你想了解更多关于数据中心冷却的safe-RL，基于对action的事后修正技术的内容请搜索CFSDN的文章或继续浏览相关文章，希望大家以后支持我的博客！。

32

4

0

文章推荐：自己做一个ChatGPT微信小程序(代码开源)

文章推荐：【故障公告】攻击式巨量并发请求再次来袭，引发博客站点故障

文章推荐：跨平台`ChatGpt`客户端

文章推荐： C++重载底层原理

python - 如何在碰撞后实现超时/冷却
我一直在制作一个 pygame 游戏，其中两名玩家尝试将球击入球网。我在我的游戏中有一个提升功能，但是，我想要一些提升垫，当它们被收集时，我得到 +3 提升。在我下面的代码中，有一个巨大的黄色，当有人
javascript - 无法在本地获取/冷却 Heroku nodejs 教程
这是我的 package.json 文件的依赖项，我在其中添加了“cool-ascii-faces”。然后我需要更新我的 index.js 文件以获取/cool 页面，这样在每次重新加载时我都会看到a

首页

博学

6Ren·AI

商城

数据中心冷却的safe-RL，基于对action的事后修正技术

一个总述

摘要

1 intro

3 preliminaries

4 performance of reward shaping

5 the safari approach

5.1 CMDP Formulation & Approach Overview

5.2 off-line IL

5.3 Online Post-hoc Rectification

6 performance evaluation

首页

博学

6Ren·AI

商城

数据中心冷却的safe-RL，基于对action的事后修正技术

一个总述

摘要

1 intro

2 related work

3 preliminaries

4 performance of reward shaping

5 the safari approach

5.1 CMDP Formulation & Approach Overview

5.2 off-line IL

5.3 Online Post-hoc Rectification

6 performance evaluation