The divergence of the Q-value estimation has been a prominent issue offline reinforcement learning (offline RL), where the agent has no access to real dynamics. Traditional beliefs attribute this instability to querying out-of-distribution actions when bootstrapping value targets. Though this issue can be alleviated with policy constraints or conservative Q estimation, a theoretical understanding of the underlying mechanism causing the divergence has been absent. In this work, we aim to thoroughly comprehend this mechanism and attain an improved solution. We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. Then, we propose a novel Self-Excite Eigenvalue Measure (SEEM) metric based on Neural Tangent Kernel (NTK) to measure the evolving property of Q-network at training, which provides an intriguing explanation of the emergence of divergence. For the first time, our theory can reliably decide whether the training will diverge at an early stage, and even predict the order of the growth for the estimated Q-value, the model's norm, and the crashing step when an SGD optimizer is used. The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely improving the model's architecture for better extrapolating behavior. Through extensive empirical studies, we identify LayerNorm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i.e. using only 1% transitions of the dataset, where all previous methods fail. Moreover, it can be easily plugged into modern offline RL methods and achieve SOTA results on many challenging tasks. We also give unique insights into its effectiveness.
In our paper, we theoretically prove that Q-value divergence happens when the maximal eigenvalue of the following matrix \(A_t\) (namely SEEM) is greater to 0: \(A_t\) is determined by a discount factor \(\gamma\) and the NTK between \(X\) and \(X^*_t\), where \(X\) is the vector composed of the state-action pairs \((s, a)\) in the offline dataset. \(X^*_t\) is the vector of the state-action pairs \((s^\prime, \pi_{\theta_t}(s^\prime))\) dependent on policy \( \pi_{\theta_t}\). The NTK matrix depicting the strength of the bond between \(X\) and \(X^\prime\) due to generalization is defined as \(G_{\theta_t}(X,X^\prime)=\phi_\theta(X)^T \phi_\theta(X^\prime)\), where \(\phi_\theta(X):=\nabla_\theta Q_\theta(X)\).
The intuitive interpretation for SEEM is as below: If SEEM is positive, the generalization bond between \(Q_\theta(X)\) and \(Q_\theta(X^*_t)\) is strong. When updating the value of \(Q_\theta(X)\) towards \(r+\gamma Q_\theta(X^*)\), due to strong generalization of the neural network, the Q-value iteration inadvertently makes \(Q_\theta(X^*)\) increase even more than the increment of \(Q_\theta(X)\). Consequently, the TD error \(r+\gamma Q_\theta(X^*) - Q_\theta(X)\) expands instead of reducing, due to the target value moving away faster than predicted value, which encourages the above procedure to repeat. This forms a positive feedback loop and causes self-excitation. Such mirage-like property causes the model’s parameter and its prediction value to diverge.
we can monitor SEEM value to know whether the training will diverge.
Except predicting whether the training will diverge, SEEM is able to predict the order of the growth for the estimated Q-value and the model's norm: