gpt4 book ai didi

vowpalwabbit - Vowpal Wabbit : question on training contextual bandit on historical data

转载 作者:行者123 更新时间:2023-12-03 16:49:02 24 4
gpt4 key购买 nike

我从 this 知道页面,有一个选项可以根据使用某些探索策略收集的历史上下文老虎机数据来训练上下文老虎机大众模型:

VW contains a contextual bandit module which allows you to optimize a predictor based on already collected contextual bandit data. In other words, the module does not implement exploration, it assumes it can only use the currently available data logged using an exploration policy.


它是通过指定 --cb 来完成的。并传递格式为 的数据行动:成本:概率 |功能 :
1:2:0.4 | a c  
3:0.5:0.2 | b d
4:1.2:0.5 | a b c
2:1:0.3 | b c
3:1.5:0.7 | a d
我的问题是,是否有一种方法可以使用 --cb 来利用不基于上下文老虎机策略的历史数据? (或其他一些方法)和一些政策评估方法?假设操作是根据某些确定性的、非探索性的(编辑:有偏见的)启发式选择的?在这种情况下,我会有 行动费用 ,但我不会有这个概率(或者它会等于 1)。
我尝试了一种方法,我使用探索性方法并假设历史数据已完全标记(为未知奖励分配零奖励),但似乎 PMF 在大多数操作中崩溃为零。

最佳答案

My question is, is there a way to leverage historical data that was not based on a contextual bandit policy using --cb (or some other method) and some policy evaluation method? Let's say actions were chosen according to some deterministic, non-exploratory heuristic? In this case, I would have the action and the cost, but I wouldn't have the probability (or it would be equal to 1).



是的,将概率设置为 1。使用退化日志记录策略没有理论上的保证,但在实践中这可能有助于初始化。展望 future ,您将希望在您的日志记录策略中有一些不确定性,否则您将永远无法改进。

I've tried a method where I use an exploratory approach and assume that the historical data is fully labelled (assign reward of zero for unknown rewards) but the PMF collapses to zero over most actions.



如果您确实有完整标记的历史数据,则可以使用 warm start functionality .如果你假装你有完全标记的数据,我不确定这比将概率设置为 1 更好。

关于vowpalwabbit - Vowpal Wabbit : question on training contextual bandit on historical data,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61670224/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com