1. 初始化 (Init)

一切的开始。我们需要一个环境 (Environment) 和一个还很笨的智能体 (Agent)。

env = gym.make("CartPole-v1") agent = Agent(state_dim=4, action_dim=2)

2. 重置 (Reset)

每个回合 (Episode) 开始前,必须让环境恢复到初始状态。

state, info = env.reset() done = False

3. 决策 (Policy)

智能体根据当前状态决定做什么。训练初期通常包含随机探索

# 智能体的方法,输入状态,输出动作 action = agent.choose_action(state)

4. 交互 (Step)

执行动作,环境会告诉你发生了什么 (New State) 以及你做得好不好 (Reward)。

next_state, reward, terminated, truncated, info = env.step(action) done = terminated or truncated

5. 学习 (Learn)

关键一步! 智能体根据刚才的经验 (s, a, r, s') 更新大脑,下次这就更聪明了。

agent.learn(state, action, reward, next_state, done) state = next_state # 准备下一轮

完整代码模板

这是一个标准的强化学习训练脚本模板,适用于大多数算法。

train_template.py
import gymnasium as gym

# --- 假设的智能体类 ---
class Agent:
    def __init__(self, action_space):
        self.action_space = action_space

    def choose_action(self, state):
        # 这里应该是策略网络
        return self.action_space.sample()

    def learn(self, state, action, reward, next_state, done):
        # 这里是核心算法 (Loss计算, Backprop)
        pass

# --- 训练主程序 ---
env = gym.make("CartPole-v1")
agent = Agent(env.action_space)

for episode in range(100):
    state, _ = env.reset()
    done = False
    total_reward = 0
    
    while not done:
        action = agent.choose_action(state)
        next_state, reward, term, trunc, _ = env.step(action)
        done = term or trunc
        
        # 学习发生在这里
        agent.learn(state, action, reward, next_state, done)
        
        state = next_state
        total_reward += reward
        
    print(f"Episode {episode}: Total Reward = {total_reward}")

env.close()
下一步:创建自定义环境