宣传网站设计,徐州编程培训机构,跨境电商app开发,建设网站的分析报告#x1f3af; REINFORCE 策略梯度算法推导#xff08;完整#xff09;
1. 目标函数定义
我们希望最大化策略的期望回报#xff1a; J ( θ ) E τ ∼ π θ [ R ( τ ) ] J(\theta) \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] J(θ)Eτ∼πθ[R(τ… REINFORCE 策略梯度算法推导完整
1. 目标函数定义
我们希望最大化策略的期望回报 J ( θ ) E τ ∼ π θ [ R ( τ ) ] J(\theta) \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] J(θ)Eτ∼πθ[R(τ)]
其中 τ ( s 0 , a 0 , s 1 , a 1 , . . . , s T , a T ) \tau (s_0, a_0, s_1, a_1, ..., s_T, a_T) τ(s0,a0,s1,a1,...,sT,aT)轨迹 R ( τ ) ∑ t 0 T r t R(\tau) \sum_{t0}^T r_t R(τ)∑t0Trt轨迹总回报 π θ ( a t ∣ s t ) \pi_\theta(a_t | s_t) πθ(at∣st)策略函数如果是连续动作空间则是概率密度函数值离散动作空间则是是一个概率值如 softmax 输出。 2. 轨迹的概率
轨迹的概率分布为 P ( τ ) ρ ( s 0 ) ⋅ ∏ t 0 T π θ ( a t ∣ s t ) ⋅ P ( s t 1 ∣ s t , a t ) P(\tau) \rho(s_0) \cdot \prod_{t0}^T \pi_\theta(a_t | s_t) \cdot P(s_{t1} | s_t, a_t) P(τ)ρ(s0)⋅t0∏Tπθ(at∣st)⋅P(st1∣st,at)
其中 ρ ( s 0 ) \rho(s_0) ρ(s0)初始状态分布 P ( s t 1 ∣ s t , a t ) P(s_{t1} | s_t, a_t) P(st1∣st,at)状态转移概率与 θ \theta θ 无关, 就是选什么动作需要概率来描述选了这个动作跳到什么状态也是不确定的也需要概率来描述。 3. 对目标函数求导
我们希望通过梯度上升更新策略参数 θ \theta θ ∇ θ J ( θ ) ∇ θ E τ ∼ π θ [ R ( τ ) ] \nabla_\theta J(\theta) \nabla_\theta \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \right] ∇θJ(θ)∇θEτ∼πθ[R(τ)]
问题如何求这个梯度由于 π θ \pi_\theta πθ 依赖于 θ \theta θ期望不能直接求导。
似然比技巧likelihood ratio trick推导如下 ∇ θ E x ∼ p θ ( x ) [ f ( x ) ] ∇ θ ∫ f ( x ) p θ ( x ) d x ∫ f ( x ) ∇ θ p θ ( x ) d x \nabla_\theta \mathbb{E}_{x \sim p_\theta(x)}[f(x)] \nabla_\theta \int f(x) p_\theta(x) dx \int f(x) \nabla_\theta p_\theta(x) dx ∇θEx∼pθ(x)[f(x)]∇θ∫f(x)pθ(x)dx∫f(x)∇θpθ(x)dx 这里之所以不对 f ( x ) f(x) f(x)求导是因为在强化学习中这里的 f ( x ) f(x) f(x)是reward是一个标量与环境交互得到的。
利用链式法则 ∇ θ p θ ( x ) p θ ( x ) ∇ θ log p θ ( x ) \nabla_\theta p_\theta(x) p_\theta(x) \nabla_\theta \log p_\theta(x) ∇θpθ(x)pθ(x)∇θlogpθ(x)
代入得 ∫ f ( x ) p θ ( x ) ∇ θ log p θ ( x ) d x E x ∼ p θ ( x ) [ f ( x ) ∇ θ log p θ ( x ) ] \int f(x) p_\theta(x) \nabla_\theta \log p_\theta(x) dx \mathbb{E}_{x \sim p_\theta(x)}[f(x) \nabla_\theta \log p_\theta(x)] ∫f(x)pθ(x)∇θlogpθ(x)dxEx∼pθ(x)[f(x)∇θlogpθ(x)] 4. 推导 log 概率项
注意 log P ( τ ) log ρ ( s 0 ) ∑ t 0 T [ log π θ ( a t ∣ s t ) log P ( s t 1 ∣ s t , a t ) ] \log P(\tau) \log \rho(s_0) \sum_{t0}^{T} \left[ \log \pi_\theta(a_t | s_t) \log P(s_{t1} | s_t, a_t) \right] logP(τ)logρ(s0)t0∑T[logπθ(at∣st)logP(st1∣st,at)]
由于 ρ ( s 0 ) \rho(s_0) ρ(s0)和 P ( s t 1 ∣ s t , a t ) P(s_{t1} | s_t, a_t) P(st1∣st,at)与 θ \theta θ 无关 ∇ θ log P ( τ ) ∑ t 0 T ∇ θ log π θ ( a t ∣ s t ) \nabla_\theta \log P(\tau) \sum_{t0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) ∇θlogP(τ)t0∑T∇θlogπθ(at∣st) 5. 得到策略梯度表达式
代入得到最终梯度表达式 ∇ θ J ( θ ) E τ ∼ π θ [ ∑ t 0 T ∇ θ log π θ ( a t ∣ s t ) ⋅ R ( τ ) ] \nabla_\theta J(\theta) \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R(\tau) \right] ∇θJ(θ)Eτ∼πθ[t0∑T∇θlogπθ(at∣st)⋅R(τ)] 6. 替换为每步折扣回报 ( G_t )
为了更准确地归因每步动作的影响引入 G t ∑ k t T γ k − t r k G_t \sum_{kt}^{T} \gamma^{k-t} r_k Gtkt∑Tγk−trk
改写为 ∇ θ J ( θ ) E τ [ ∑ t 0 T ∇ θ log π θ ( a t ∣ s t ) ⋅ G t ] \nabla_\theta J(\theta) \mathbb{E}_{\tau} \left[ \sum_{t0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t \right] ∇θJ(θ)Eτ[t0∑T∇θlogπθ(at∣st)⋅Gt] 7. 引入 baseline 减少方差
减去一个与动作无关的 baseline b ( s t ) b(s_t) b(st) ∇ θ J ( θ ) E τ [ ∑ t 0 T ∇ θ log π θ ( a t ∣ s t ) ⋅ ( G t − b ( s t ) ) ] \nabla_\theta J(\theta) \mathbb{E}_{\tau} \left[ \sum_{t0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot (G_t - b(s_t)) \right] ∇θJ(θ)Eτ[t0∑T∇θlogπθ(at∣st)⋅(Gt−b(st))]
常用 baseline b ( s t ) V π ( s t ) ⇒ A t G t − V ( s t ) b(s_t) V^\pi(s_t) \quad \Rightarrow \quad A_t G_t - V(s_t) b(st)Vπ(st)⇒AtGt−V(st)
最终得到优势形式 ∇ θ J ( θ ) E [ ∑ t 0 T ∇ θ log π θ ( a t ∣ s t ) ⋅ A t ] \nabla_\theta J(\theta) \mathbb{E} \left[ \sum_{t0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] ∇θJ(θ)E[t0∑T∇θlogπθ(at∣st)⋅At] ✅ 常见策略梯度形式总结
名称表达式REINFORCE ∇ θ J ( θ ) E [ ∑ t ∇ θ log π θ ( a t ∣ s t ) ⋅ G t ] \nabla_\theta J(\theta) \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot G_t \right] ∇θJ(θ)E[∑t∇θlogπθ(at∣st)⋅Gt]baseline形式 ∇ θ J ( θ ) E [ ∑ t ∇ θ log π θ ( a t ∣ s t ) ⋅ ( G t − b ( s t ) ) ] \nabla_\theta J(\theta) \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot (G_t - b(s_t)) \right] ∇θJ(θ)E[∑t∇θlogπθ(at∣st)⋅(Gt−b(st))]Advantage形式 ∇ θ J ( θ ) E [ ∑ t ∇ θ log π θ ( a t ∣ s t ) ⋅ A t ] \nabla_\theta J(\theta) \mathbb{E} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t \right] ∇θJ(θ)E[∑t∇θlogπθ(at∣st)⋅At] 附连续动作高斯策略的梯度
假设策略为 π θ ( a ∣ s ) N ( μ θ ( s ) , σ 2 ) \pi_\theta(a|s) \mathcal{N}(\mu_\theta(s), \sigma^2) πθ(a∣s)N(μθ(s),σ2) 则 log π θ ( a ∣ s ) − ( a − μ θ ( s ) ) 2 2 σ 2 const \log \pi_\theta(a|s) -\frac{(a - \mu_\theta(s))^2}{2\sigma^2} \text{const} logπθ(a∣s)−2σ2(a−μθ(s))2const 对策略参数的梯度为 ∇ θ log π θ ( a ∣ s ) ( a − μ θ ( s ) ) σ 2 ⋅ ∇ θ μ θ ( s ) \nabla_\theta \log \pi_\theta(a|s) \frac{(a - \mu_\theta(s))}{\sigma^2} \cdot \nabla_\theta \mu_\theta(s) ∇θlogπθ(a∣s)σ2(a−μθ(s))⋅∇θμθ(s)