Q-Learning - RL学习平台

什么是 Q-Table?

Q 表的本质

Q-Table 就像一张作弊小抄。对于每一个状态（State），它记录了采取不同动作（Action）能获得的预期回报。

行 (Rows): 代表所有可能的状态
列 (Columns): 代表所有可能的动作
值 (Values): 代表 Q 值，越大越好

0.2 0.8 0.1 0.5

State S

核心公式：贝尔曼方程

                    $$Q(s,a) \leftarrow (1-\alpha) \underbrace{Q(s,a)}_{\text{旧值}} + \alpha [\underbrace{r + \gamma \max_{a'} Q(s', a')}_{\text{现实目标}}]$$
                

现实目标 (Target)

$r + \gamma \max Q(s', a')$：我刚拿到的奖励 $r$，加上我对未来最好情况的预估。

学习率 (Alpha, $\alpha$)

我们要多大程度上相信新的经验？$\alpha$ 越大，学得越快，但也越不稳定。

折扣因子 (Gamma, $\gamma$)

未来的奖励值多少钱？$\gamma$ 越接近 1，智能体越有远见。

算法流程

1. 初始化

建立一个全 0 的 Q 表。

Q = np.zeros((state_size, action_size))

2. 选择动作 (Epsilon-Greedy)

为了探索环境，我们有时会随机乱走（Exploration），有时会选择当前认为最好的路（Exploitation）。

if random.random() < epsilon: action = env.action_space.sample() # 随机 else: action = np.argmax(Q[state]) # 贪婪

3. 更新 Q 值

根据贝尔曼方程更新刚才那个格子 $Q(s, a)$ 的值。

target = reward + gamma * np.max(Q[next_state]) Q[state, action] += alpha * (target - Q[state, action])

完整代码：FrozenLake

q_learning.py

import gymnasium as gym
import numpy as np

env = gym.make("FrozenLake-v1", is_slippery=False)
Q = np.zeros((env.observation_space.n, env.action_space.n))

alpha = 0.1   # 学习率
gamma = 0.99  # 折扣因子
epsilon = 0.1 # 探索率

for episode in range(1000):
    state, _ = env.reset()
    done = False
    
    while not done:
        # Epsilon-Greedy 策略
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])
            
        next_state, reward, term, trunc, _ = env.step(action)
        done = term or trunc
        
        # 更新 Q 表
        best_next_action = np.argmax(Q[next_state])
        target = reward + gamma * Q[next_state, best_next_action]
        Q[state, action] += alpha * (target - Q[state, action])
        
        state = next_state