强化学习,观止矣

还在为学习强化学习烦恼?一个网站带你从0到1学会RL!

理论知识

系统学习强化学习的核心概念和原理

什么是强化学习?

强化学习(Reinforcement Learning, RL)是机器学习的一个分支,通过智能体(Agent)与环境(Environment)的交互来学习最优策略。

核心要素:

  • 智能体(Agent):执行动作的决策者
  • 环境(Environment):智能体交互的外部世界
  • 状态(State):环境的当前情况
  • 动作(Action):智能体可以执行的操作
  • 奖励(Reward):环境对动作的反馈
  • 策略(Policy):从状态到动作的映射

马尔可夫决策过程(MDP)

MDP是强化学习的数学框架,用于描述在不确定环境中进行决策的问题。

MDP的组成:

  • S:状态空间
  • A:动作空间
  • P:状态转移概率
  • R:奖励函数
  • γ:折扣因子(0 ≤ γ ≤ 1)

V(s) = E[∑t=0 γtRt+1 | S0 = s]

价值函数与策略

价值函数用于评估状态或动作的好坏,策略定义了智能体的行为方式。

关键概念:

  • 状态价值函数 V(s):从状态s开始遵循策略π的期望回报
  • 动作价值函数 Q(s,a):在状态s执行动作a后遵循策略π的期望回报
  • 最优策略 π*:使价值函数最大的策略

Q*(s,a) = maxπ Qπ(s,a)

主要算法

强化学习有多种算法,从简单的Q-learning到复杂的深度强化学习。

经典算法:

  • Q-Learning:无模型的时序差分学习
  • SARSA:在线策略的时序差分学习
  • Policy Gradient:直接优化策略的方法
  • DQN:深度Q网络,结合深度学习和Q-learning
  • PPO:近端策略优化,稳定且高效

环境配置

一步步配置您的强化学习开发环境

1

安装Python

推荐使用Python 3.8或更高版本

检查Python版本
python --version
# 或
python3 --version

如果未安装,请访问 python.org 下载安装

2

创建虚拟环境(推荐)

使用虚拟环境可以避免包冲突

创建虚拟环境
# Windows
python -m venv rl_env
rl_env\Scripts\activate

# macOS/Linux
python3 -m venv rl_env
source rl_env/bin/activate
3

安装核心库

安装强化学习开发所需的主要库

安装依赖
# 安装基础库
pip install numpy matplotlib

# 安装强化学习库
pip install gymnasium
pip install stable-baselines3

# 安装深度学习框架(可选,用于深度强化学习)
pip install torch
4

验证安装

运行测试代码确认环境配置成功

测试代码
import gymnasium as gym
import numpy as np

# 创建环境
env = gym.make('CartPole-v1')
print("环境创建成功!")
print(f"状态空间: {env.observation_space}")
print(f"动作空间: {env.action_space}")

# 测试一个回合
obs, info = env.reset()
for _ in range(10):
    action = env.action_space.sample()
    obs, reward, terminated, truncated, info = env.step(action)
    print(f"奖励: {reward}")
    if terminated or truncated:
        break
env.close()
print("环境测试完成!")

配置提示

  • 如果遇到安装问题,可以尝试使用 pip install --upgrade pip 升级pip
  • 对于GPU加速,需要安装对应版本的PyTorch(访问 pytorch.org 获取安装命令)
  • 推荐使用IDE:PyCharm、VS Code 或 Jupyter Notebook

Demo演示

运行您的第一个强化学习代码,看到实际效果

第一个强化学习程序

这是一个最简单的强化学习示例,使用随机策略在CartPole环境中进行交互。

simple_rl.py
import gymnasium as gym
import numpy as np

# 创建CartPole环境
env = gym.make('CartPole-v1', render_mode='human')

# 运行多个回合
for episode in range(5):
    obs, info = env.reset()
    total_reward = 0
    steps = 0
    
    while True:
        # 随机选择动作(这是最简单的策略)
        action = env.action_space.sample()
        
        # 执行动作
        obs, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        steps += 1
        
        # 如果回合结束
        if terminated or truncated:
            print(f"回合 {episode + 1}: 步数 = {steps}, 总奖励 = {total_reward:.2f}")
            break

env.close()

Q-Learning算法实现

使用Q-Learning算法训练智能体在FrozenLake环境中找到最优路径。

qlearning.py
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt

# 创建环境
env = gym.make('FrozenLake-v1', is_slippery=True)

# Q-Learning参数
learning_rate = 0.1
discount_factor = 0.95
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01
episodes = 2000

# 初始化Q表
Q = np.zeros([env.observation_space.n, env.action_space.n])

# 记录奖励
rewards = []

# 训练
for episode in range(episodes):
    state, info = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        # ε-贪婪策略
        if np.random.random() < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state, :])
        
        # 执行动作
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        
        # Q-Learning更新
        Q[state, action] = Q[state, action] + learning_rate * (
            reward + discount_factor * np.max(Q[next_state, :]) - Q[state, action]
        )
        
        state = next_state
        total_reward += reward
    
    rewards.append(total_reward)
    epsilon = max(epsilon_min, epsilon * epsilon_decay)
    
    if (episode + 1) % 200 == 0:
        avg_reward = np.mean(rewards[-200:])
        print(f"回合 {episode + 1}: 平均奖励 = {avg_reward:.2f}")

# 测试训练好的策略
print("\n测试训练好的策略:")
state, info = env.reset()
env = gym.make('FrozenLake-v1', render_mode='human', is_slippery=True)
state, info = env.reset()

for step in range(100):
    action = np.argmax(Q[state, :])
    state, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        print(f"到达目标!步数: {step + 1}")
        break

env.close()

深度Q网络(DQN)

使用深度神经网络和stable-baselines3库实现DQN算法,在CartPole环境中训练智能体。

dqn.py
from stable_baselines3 import DQN
from stable_baselines3.common.env_util import make_vec_env
import gymnasium as gym

# 创建环境
env = make_vec_env('CartPole-v1', n_envs=1)

# 创建DQN模型
model = DQN(
    'MlpPolicy',
    env,
    learning_rate=1e-3,
    buffer_size=10000,
    learning_starts=1000,
    batch_size=32,
    gamma=0.99,
    target_update_interval=100,
    verbose=1
)

# 训练模型
print("开始训练DQN模型...")
model.learn(total_timesteps=10000)
print("训练完成!")

# 保存模型
model.save("dqn_cartpole")
print("模型已保存为 dqn_cartpole.zip")

# 测试模型
print("\n测试训练好的模型:")
env = gym.make('CartPole-v1', render_mode='human')
obs, info = env.reset()

for i in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    
    if terminated or truncated:
        print(f"回合结束,步数: {i + 1}")
        obs, info = env.reset()

env.close()

SARSA算法实现

SARSA(State-Action-Reward-State-Action)是一种在线策略的时序差分学习算法,与Q-Learning的主要区别在于它使用实际执行的动作来更新Q值。

sarsa.py
import gymnasium as gym
import numpy as np

# 创建环境
env = gym.make('FrozenLake-v1', is_slippery=True)

# SARSA参数
learning_rate = 0.1
discount_factor = 0.95
epsilon = 1.0
epsilon_decay = 0.995
epsilon_min = 0.01
episodes = 2000

# 初始化Q表
Q = np.zeros([env.observation_space.n, env.action_space.n])

# 记录奖励
rewards = []

# 训练
for episode in range(episodes):
    state, info = env.reset()
    
    # 选择初始动作(ε-贪婪策略)
    if np.random.random() < epsilon:
        action = env.action_space.sample()
    else:
        action = np.argmax(Q[state, :])
    
    total_reward = 0
    done = False
    
    while not done:
        # 执行动作
        next_state, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated
        
        # 选择下一个动作(SARSA使用实际执行的动作)
        if np.random.random() < epsilon:
            next_action = env.action_space.sample()
        else:
            next_action = np.argmax(Q[next_state, :])
        
        # SARSA更新(使用下一个实际执行的动作)
        if not done:
            Q[state, action] = Q[state, action] + learning_rate * (
                reward + discount_factor * Q[next_state, next_action] - Q[state, action]
            )
        else:
            Q[state, action] = Q[state, action] + learning_rate * (reward - Q[state, action])
        
        state = next_state
        action = next_action
        total_reward += reward
    
    rewards.append(total_reward)
    epsilon = max(epsilon_min, epsilon * epsilon_decay)
    
    if (episode + 1) % 200 == 0:
        avg_reward = np.mean(rewards[-200:])
        print(f"回合 {episode + 1}: 平均奖励 = {avg_reward:.2f}")

# 测试训练好的策略
print("\n测试训练好的策略:")
env = gym.make('FrozenLake-v1', render_mode='human', is_slippery=True)
state, info = env.reset()

for step in range(100):
    action = np.argmax(Q[state, :])
    state, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        print(f"到达目标!步数: {step + 1}")
        break

env.close()

近端策略优化(PPO)

PPO(Proximal Policy Optimization)是一种稳定且高效的策略梯度算法,通过限制策略更新的幅度来避免训练不稳定。它是目前最流行的深度强化学习算法之一。

ppo.py
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
import gymnasium as gym

# 创建环境(使用向量化环境可以加速训练)
env = make_vec_env('CartPole-v1', n_envs=4)

# 创建PPO模型
model = PPO(
    'MlpPolicy',
    env,
    learning_rate=3e-4,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.99,
    gae_lambda=0.95,
    clip_range=0.2,
    ent_coef=0.01,
    vf_coef=0.5,
    max_grad_norm=0.5,
    verbose=1
)

# 训练模型
print("开始训练PPO模型...")
model.learn(total_timesteps=50000)
print("训练完成!")

# 保存模型
model.save("ppo_cartpole")
print("模型已保存为 ppo_cartpole.zip")

# 测试模型
print("\n测试训练好的模型:")
env = gym.make('CartPole-v1', render_mode='human')
obs, info = env.reset()

for i in range(10):
    total_reward = 0
    while True:
        action, _states = model.predict(obs, deterministic=True)
        obs, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        
        if terminated or truncated:
            print(f"回合 {i + 1}: 总奖励 = {total_reward:.2f}")
            obs, info = env.reset()
            break

env.close()
print("\nPPO训练完成!模型表现优秀。")

优势演员-评论家(A2C)

A2C(Advantage Actor-Critic)是一种结合了价值函数和策略梯度的算法,使用优势函数来减少方差,提高训练稳定性。

a2c.py
from stable_baselines3 import A2C
from stable_baselines3.common.env_util import make_vec_env
import gymnasium as gym

# 创建环境
env = make_vec_env('CartPole-v1', n_envs=4)

# 创建A2C模型
model = A2C(
    'MlpPolicy',
    env,
    learning_rate=7e-4,
    n_steps=5,
    gamma=0.99,
    gae_lambda=1.0,
    ent_coef=0.01,
    vf_coef=0.25,
    max_grad_norm=0.5,
    verbose=1
)

# 训练模型
print("开始训练A2C模型...")
model.learn(total_timesteps=50000)
print("训练完成!")

# 保存模型
model.save("a2c_cartpole")
print("模型已保存为 a2c_cartpole.zip")

# 测试模型
print("\n测试训练好的模型:")
env = gym.make('CartPole-v1', render_mode='human')
obs, info = env.reset()

for i in range(10):
    total_reward = 0
    while True:
        action, _states = model.predict(obs, deterministic=True)
        obs, reward, terminated, truncated, info = env.step(action)
        total_reward += reward
        
        if terminated or truncated:
            print(f"回合 {i + 1}: 总奖励 = {total_reward:.2f}")
            obs, info = env.reset()
            break

env.close()
print("\nA2C训练完成!")