AI
Off-Policy Selection (OPS) aims to select the best policy from a set of policies trained using offline Reinforcement Learning. In this work, we describe our custom OPS method and its successful application in Samsung Instant Plays for optimizing ad delivery timings. The motivation behind proposing our custom OPS method is the fact that traditional Off-Policy Evaluation (OPE) methods often exhibit enormous variance leading to unreliable results. We applied our OPS method to initialize policies for our custom pseudo-online training pipeline. The final policy resulted in a substantial 49% lift in the number of watched ads while maintaining similar retention rate.
Samsung’s Instant Plays is a mobile application which allows users to play various games without any fees. The application generates revenue by displaying ads to users during a game-play. There are two major types of ads in our platform:
In our project we take control over the latter, that is, the interstitial ads. During a gaming session, the user plays for a number of stages. Between each stage the system decides whether to display an ad. If the system displays ads too often for a specific user, that user might churn. On the contrary, if a user doesn’t mind watching ads and we don’t display them, then we might lose some revenue.
The decision process of displaying interstitial ads is sequential in nature. In each timestamp when there is a slot for an interstitial ad, the system must decide whether to display it or not. The system might withhold showing an ad in some particular moment for the sake of better future moments. Thus, the Reinforcement Learning (RL) paradigm fits for our use case. We aim to train RL agent, which utilizes contextual information (user embedding, user profile, game embedding, time from last ad, number of ads shown in the last 10 minutes, etc.) to select the best slots for displaying ads. It is worth mentioning that our system aims only to select the timing of an ad, not the specific content of it. Thus, our problem is orthogonal to the content recommendation, but equally important.
In order to initialize Reinforcement Learning agents for our system, we used offline RL paradigm and conducted the initial training, experimenting with off-policy methods like Rainbow DQN [4] and MARWIL [7]. However, it is known that models naively pretrained on offline data may underperform when deployed into real-life [6, 8]. Thus, we would like to use a reliable off-policy evaluation (OPE) method, where the performance of trained policies is evaluated solely using offline data. Unfortunately, we tested some common OPE methods including Importance Sampling [2] and Doubly Robust [3], and it turned out that getting accurate point estimates from OPEs is nearly impossible due to the their unrealistic estimates and huge variance (more on that in section 4.1). In such cases, OPS tasks, such as selecting the best performing policy from a set of policies, become equally important and are often easier to realize (fig. 1).
Figure 1. Off-policy selection (OPS) aims to select the best policy out of a pool of policies. This is a common scenario in the real-world recommender systems where running multiple policies on an environment may be impossible.
We propose an Off-Policy Selection method based on measuring similarities between trajectories from the offline dataset and those that the trained policy would take. Let’s say that we have an offline dataset D consisting of M trajectories and a trained policy π. For each trajectory τ=(o0,a0,r1,o1,a1,…) from the offline dataset D we apply trained policy to the underlying observations and, as a result, generate a series of new actions π(oi )= a ̅i. We compare series of actions (a0,a1,...) with the newly created ones (a ̅0,a ̅1,…) by computing Euclidean distance between corresponding elements and assign a distance dτ. Given all such distances, we sort them in an ascending order d'1 < · · · < d'M and assign a final score of a trained policy as an average return from p% of the most similar trajectories, i.e. 1/N ∑N(n=1)r'i , where r'i is a return corresponding to the distance d'i.
In order to realize our ideas, we needed an RL framework which would support offline Reinforcement Learning and Off-Policy Evaluation methods. We started with ReAgent suite [1] to develop proof of concept. However, we eventually switched to Ray [5] because this framework is more developer-friendly, supports more off-policy methods, is better suited for heavy-duty applications, contains modules for serving RL models, and is overall very well maintained.
Since we used off-policy methods, we needed to store data for training in a convenient way. We utilized replay buffer, we updated it using historical data and done the retraining of RL models every 4 hours. The RL models were hosted using Ray Serve module, which allowed us to provide response to Samsung Instant Plays application within 100ms. The workflow of our system is summarised in fig. 2.
Figure 2. The workflow of our system. The response time is expected to be delivered under 100ms. Replay buffer update and retraining is done every 4 hours.
To validate our Off-Policy Selection method described in section 2.2, we computed estimates for our method and compared them with real online experiments. We trained an RL agent on historical data removing last week for OPE methods evaluation. In table 1, we summarise estimated lifts in the number of interstitial ads watched on the one-week test split.
Table 1. Estimated lift in the number of watched interstitial ads according to different OPE methods.
We hypothesize that such extreme policy value estimates of the above popular OPE methods are due to the nature of our use case. The game sessions for some users might last for an extended period, resulting in long sequences of actions. OPE methods try to correct the estimate by using the product of inverse propensity scores, which leads to a compounding error.
We conducted A/B tests where we compared our RL models against rule-based system which was used at that time. We focused on two metrics during that online evaluation: lift of the absolute number of watched interstitial ads and the retention rate. We ran the A/B test for a month and, similar to the offline setting, we aggregated the results from the last week of our A/B test. For the final results we obtained 49% lift in the absolute number of watched ads while maintaining very similar retention rate which slightly rose by 2%.
https://doi.org/10.1145/3640457.3688058
[1] Jason Gauci, Edoardo Conti, Yitao Liang, Kittipat Virochsiri, Zhengxing Chen, Yuchen He, Zachary Kaden, Vivek Narayanan, and Xiaohui Ye. 2018. Horizon: Facebook’s Open Source Applied Reinforcement Learning Platform. arXiv preprint arXiv:1811.00260 (2018).
[2] Josiah P. Hanna, Scott Niekum, and Peter Stone. 2019. Importance Sampling Policy Evaluation with an Estimated Behavior Policy. arXiv:1806.01347 [cs.LG]
[3] Nan Jiang and Lihong Li. 2015. Doubly Robust Off-policy Evaluation for Re- inforcement Learning. CoRR abs/1511.03722 (2015). arXiv:1511.03722 http://arxiv.org/abs/1511.03722
[4] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602 [cs.LG]
[5] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: a distributed framework for emerging AI applications. In Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation. 561–577.
[6] Ikechukwu Uchendu, Ted Xiao, Yao Lu, Banghua Zhu, Mengyuan Yan, Joséphine Simon, Matthew Bennice, Chuyuan Fu, Cong Ma, Jiantao Jiao, et al. 2023. Jump- start reinforcement learning. In International Conference on Machine Learning. PMLR, 34556–34583.
[7] Qing Wang, Jiechao Xiong, Lei Han, peng sun, Han Liu, and Tong Zhang. 2018. Exponentially Weighted Imitation Learning for Batched Historical Data. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Cur- ran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/ 4aec1b3435c52abbdf8334ea0e7141e0-Paper.pdf
[8] Chengyang Ying, Zhongkai Hao, Xinning Zhou, Hang Su, Dong Yan, and Jun Zhu. 2023. On the Reuse Bias in Off-Policy Reinforcement Learning. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI- 23, Edith Elkind (Ed.). International Joint Conferences on Artificial Intelligence Organization, 4513–4521. https://doi.org/10.24963/ijcai.2023/502 Main Track.