1.4.2 Challenges
Off-Line Learning
Training is also not possible directly online, but learning happens offline, using records from a previous iteration of the management system. Broadly speaking, we would like it to be the case that the new system version works better than the old one and that implies that we will need to perform off-policy assessment (predicting performance before running it on the actual system). There are a couple of approaches, including large sampling, for doing this. The introduction of the first RL version (the initial policy) is one special case to consider; there is also a minimum output requirement to be met before this is supposed to occur. The warm-start efficiency is therefore another important ability to be able to assess.
Learning From Limited Samples
There are no different training and assessment environments for many actual systems. All training knowledge comes from the real system, and during training, the agent does not have a separate exploration policy as its exploratory acts do not come for free. Given this greater exploration expense, and the fact that very little of the state space is likely to be explored by logs for learning from, policy learning needs to be data-efficient. Control frequencies may be 1 h or even multi-month time steps (opportunities to take action) and even longer incentive horizons. One easy way to measure a model’s data efficiency is to look at the amount of data needed to meet a certain output threshold.
High-Dimensional State and Action Spaces
For several realistic real-world problems, there are wide and consistent state and action spaces, which can pose serious problems for traditional RL algorithms. One technique is to generate a vector of candidate action and then do a closest neighbor search to determine the nearest accessible real action.
Safety Constraints
Many control systems must function under security restrictions, even during phases of exploratory learning. Constrained MDPs (Markov Decision Processes) make it possible to define constraints on states and behavior. Budgeted MDPs enable the degree of constraint/performance trade-off to be explored rather than simply hard-wired by letting constraint levels be learned. Another solution is to add to the network a protection layer that prevents any breaches of safety.
Partial Observability
It is partly measurable for almost all real systems where we would like to incorporate reinforcement learning. For example, the efficiency of mechanical parts may deteriorate over time, ‘identical’ widgets may exhibit performance variations provided the same control inputs, or it may simply be unknown the condition of certain parts of the system (e.g. the mental state of users of a suggested system).
Two common strategies to dealing with partial observability, including input history, and modelling history using repeated networks in the model. In addition, Robust MDP formalisms provide clear mechanisms to ensure that sensor and action noise and delays are robust to agents. If a given deployment setting may have initially unknown but learnable noise sources, then techniques for device detection may be used to train a policy that can learn in which environment it operates.
Reward Functions
Device or product owners do not have a good image of what they want to refine in certain instances. The incentive function is always multidimensional and involves different sub-goals to be balanced. Another great insight here which reminds me of machine latency discussions) is that ‘normal performance’ (i.e. expectation) is always an inadequate measure, and for all task instances, the system needs to perform well. A common approach is to use a Conditional Value at Risk (CVaR) target to measure the full distribution of rewards across classes, which looks at a given percentile of the distribution of rewards rather than the predicted reward.
Explainability/Interpretability
Real systems are owned and controlled by humans who need to be informed about the actions of the controller and need insights into cases of failure. For this purpose, for real-world policies, policy clarity is critical. In order to obtain stakeholder buy-in, it is necessary to consider the longer-term purpose of the policy, particularly in cases where the policy can find another solution and unforeseen approach to managing a system.
Real-Time Inference
Policy inference has to occur within the system’s control frequency. This could be in the order of milliseconds or shorter. This prevents us from using costly computational methods that do not follow the constraints (for example, certain types of model-based planning). Of course, systems with longer control intervals cause the opposite problem: in order to speed up data generation, we cannot run the task faster than in real time.
Delayed Rewards
Most real systems have interruptions in the state’s sensation, the actuators, or the feedback on the reward. For instance, delays in the effects of a braking system, or delays between a recommendation system’s choices and consequent user behaviors. There are a number of possible methods to deal with this, including memory-based agents that leverage a memory recovery system to allocate credit to distant past events that are helpful in forecasting [1, 15].
1.5 Conclusion
Deep Reinforcement Learning is the fusion of reinforcement learning (RL) and deep learning. This field of research has been able to solve a wide range of dynamic decision-making operations that were traditionally out of control for a computer. In applications such as medical, automation, smart grids, banking, and plenty more, deep RL thus brings up many new applications. We give an overview of the deep reinforcement learning (RL) paradigm and learning algorithm choices. We begin with deep learning and reinforcement learning histories, as well as the implementation of the Markov method. Next, we summarize some popular applications in various fields and, eventually, we end up addressing some possible challenges in the future growth of DRL.
References
1. Arulkumaran, K., Deisenroth, M., Brundage, M., Bharath, A., A Brief Survey of Deep Reinforcement Learning. IEEE Signal Process. Mag., 34, 1–16, 2017, 10.1109/MSP .2017.2743240.
2. Botvinick, M., Wang, J., Dabney, W., Miller, K., Kurth-Nelson, Z., Deep Reinforcement Learning and its Neuroscientific Implications, Neuron, 107, 603–616. 2020.
3. Duryea, E., Ganger, M., Hu, W., Exploring Deep Reinforcement Learning with Multi Q-Learning. Intell. Control Autom., 07, 129–144, 2016, 10.4236/ica.2016.74012.
4. Fenjiro, Y. and Benbrahim, H., Deep Reinforcement Learning Overview of the state of the Art. J. Autom. Mob. Robot. Intell. Syst., 12, 20–39, 2018, 10.14313/JAMRIS_3-2018/15.
5. Francois, V., Henderson, P., Islam, R., Bellemare, M., Pineau, J., An Introduction to Deep Reinforcement Learning, Foundations and Trends in Machine Learning, Boston—Delft, 2018, 10.1561/2200000071.
6. Haj Ali, A., Ahmed, N., Willke, T., Gonzalez, J., Asanovic, K., Stoica, I., A View on Deep Reinforcement Learning in System Optimization, arXiv:1908.01275v3 Intel Labs, University of California, Berkeley, 2019.
7.