训练智能体
拆解强化学习的核心循环:Interaction + Learning
1. 初始化 (Init)
一切的开始。我们需要一个环境 (Environment) 和一个还很笨的智能体 (Agent)。
env = gym.make("CartPole-v1")
agent = Agent(state_dim=4, action_dim=2)
2. 重置 (Reset)
每个回合 (Episode) 开始前,必须让环境恢复到初始状态。
state, info = env.reset()
done = False
3. 决策 (Policy)
智能体根据当前状态决定做什么。训练初期通常包含随机探索。
# 智能体的方法,输入状态,输出动作
action = agent.choose_action(state)
4. 交互 (Step)
执行动作,环境会告诉你发生了什么 (New State) 以及你做得好不好 (Reward)。
next_state, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
5. 学习 (Learn)
关键一步! 智能体根据刚才的经验 (s, a, r, s') 更新大脑,下次这就更聪明了。
agent.learn(state, action, reward, next_state, done)
state = next_state # 准备下一轮
完整代码模板
这是一个标准的强化学习训练脚本模板,适用于大多数算法。
train_template.py
import gymnasium as gym # --- 假设的智能体类 --- class Agent: def __init__(self, action_space): self.action_space = action_space def choose_action(self, state): # 这里应该是策略网络 return self.action_space.sample() def learn(self, state, action, reward, next_state, done): # 这里是核心算法 (Loss计算, Backprop) pass # --- 训练主程序 --- env = gym.make("CartPole-v1") agent = Agent(env.action_space) for episode in range(100): state, _ = env.reset() done = False total_reward = 0 while not done: action = agent.choose_action(state) next_state, reward, term, trunc, _ = env.step(action) done = term or trunc # 学习发生在这里 agent.learn(state, action, reward, next_state, done) state = next_state total_reward += reward print(f"Episode {episode}: Total Reward = {total_reward}") env.close()